Using CrawlSpider from Scrapy, Item list page-> Individual item overview page-> Individual item detail page, and crawl the site where you can follow the link, It is assumed that the information on the detail page is scraped and saved.
The correspondence between the page and the URL looks like the one below.
page | URL |
---|---|
Item List | example.com/list |
Item overview | example.com/item/(ID)/ |
Item details | example.com/item/(ID)/details |
For a site with this structure, if you can add / details to the end of the link to the summary page extracted from the list page and use it to request the details page directly, you can go to the other party's site. The number of requests has been halved, and the time it takes to execute this program has also been reduced, so two birds with one stone! So the following is an implementation example.
In the argument * process_value * of LinkExtractor, describe the process to process the URL with a lambda expression.
example.py
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://example.com/list'] #Item list page
rules = [
Rule(LinkExtractor(
#/item/Extract URLs that include
allow=r'.*/item/.*',
#To the extracted URL'details/'Add
process_value= lambda x:x + 'details/'
),callback='parse_details'),
]
#
def parse_details(self, response):
#(abridgement)
that's all!