When I tried to write a unit test for Scrapy, it was a bit special and I didn't have much information, so I summarized it. Due to the nature of crawlers that HTML can be changed at any time, I think it is better to mainly use it to shorten the crawl time at the time of implementation rather than validation. (* Mainly articles about Spider unit tests) (* Tests such as Pipeline are out of range because they can be written normally with unittest etc.)
TL;DR;
Use Spiders Contracts
scrapy check spidername
def parse(self, response):
""" This function parses a sample response. Some contracts are mingled
with this docstring.
@url http://www.amazon.com/s?field-keywords=selfish+gene
@returns items 1 16
@returns requests 0 0
@scrapes Title Author Year Price
"""
I think it's quick to see the sample code below. (Python3.6.2, Scrapy 1.4.0)
@ contract name arg1 arg2 arg3 ...
myblog.py
def parse_list(self, response):
"""List screen parsing process
@url http://www.rhoboro.com/index2.html
@returns item 0 0
@returns requests 0 10
"""
for detail in response.xpath('//div[@class="post-preview"]/a/@href').extract():
yield Request(url=response.urljoin(detail), callback=self.parse_detail)
Contracts can be extended by creating your own subclasses. Register the created Cntracts in setting.py.
self.args
pre_process (self, response)
methodpost_process (self, output)
methodcontracts.py
# -*- coding: utf-8 -*-
from scrapy.contracts import Contract
from scrapy.exceptions import ContractFail
class ItemValidateContract(Contract):
"""Check if Item is as expected
Because the acquisition result may change at any time
I think it's best to test only where you expect invariant values.
Should I check more than missing elements with Pipeline?
"""
name = 'item_validate' #This name will be the name in the docstring
def post_process(self, output):
item = output[0]
if 'title' not in item:
raise ContractFail('title is invalid.')
class CookiesContract(Contract):
"""On request(scrapy)Contract to add cookies
@cookies key1 value1 key2 value2
"""
name = 'cookies'
def adjust_request_args(self, kwargs):
# self.Convert args to dictionary format and put in cookies
kwargs['cookies'] = {t[0]: t[1]
for t in zip(self.args[::2], self.args[1::2])}
return kwargs
The code on the side that uses this looks like this.
settings.py
...
SPIDER_CONTRACTS = {
'item_crawl.contracts.CookiesContract': 10,
'item_crawl.contracts.ItemValidateContract': 20,
}
...
myblog.py
def parse_detail(self, response):
"""Detail screen parsing process
@url http://www.rhoboro.com/2017/08/05/start-onomichi.html
@returns item 1
@scrapes title body tags
@item_validate
@cookies index 2
"""
item = BlogItem()
item['title'] = response.xpath('//div[@class="post-heading"]//h1/text()').extract_first()
item['body'] = response.xpath('//article').xpath('string()').extract_first()
item['tags'] = response.xpath('//div[@class="tags"]//a/text()').extract()
item['index'] = response.request.cookies['index']
yield item
Run with scrapy check spidername
.
Obviously, it's faster than trying scrapy crawl spidername because it only crawls the specified page.
(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog [master:crawler]
.....
----------------------------------------------------------------------
Ran 5 contracts in 8.919s
OK
in
parse_detail ())(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog [master:crawler]
...FF
======================================================================
FAIL: [my_blog] parse_detail (@scrapes post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
self.post_process(output)
File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/default.py", line 89, in post_process
raise ContractFail("'%s' field is missing" % arg)
scrapy.exceptions.ContractFail: 'title' field is missing
======================================================================
FAIL: [my_blog] parse_detail (@item_validate post-hook)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rhoboro/github/scrapy/venv/lib/python3.6/site-packages/scrapy/contracts/__init__.py", line 134, in wrapper
self.post_process(output)
File "/Users/rhoboro/github/scrapy/crawler/crawler/contracts.py", line 18, in post_process
raise ContractFail('title is invalid.')
scrapy.exceptions.ContractFail: title is invalid.
----------------------------------------------------------------------
Ran 5 contracts in 8.552s
FAILED (failures=2)
By the way, here in case of an error. (This is when I forgot to mention settings.py.) To be honest, there is too little information and it's hard.
(venv) [alpaca]~/github/scrapy/crawler/crawler % scrapy check my_blog [master:crawler]
Unhandled error in Deferred:
----------------------------------------------------------------------
Ran 0 contracts in 0.000s
OK
Recommended Posts