If you increase the scraping target (spider file) in Scrapy, the pipeline processing code will increase in the same file, which will make the visibility worse and maintainability problems. I came. Finally, I was able to separate the pipeline implementation file for each spider, so I will introduce the method.
A Scrapy project has a configuration file called setting.py
.
Even if there is a setting item of ʻITEM_PIPELINES in
setting.py` and there are multiple spiders,
Initially I thought I had no choice but to aggregate the pipeline processing into classes in the single implementation file specified here.
setting.py
ITEM_PIPELINES = {
'example_project.pipelines.DBPipeline': 100,
}
--Pattern that processes and branches with the name of siper
I was routing with the spider name as the key, but it is clear that the visibility of the code gets worse as the spider increases.
pipelines.py
class DBPipeline(object):
def process_item(self, item, spider):
if spider.name in ['example_spider']:
# example_Spider pipeline processing
if spider.name in ['example_spider2']:
# example_Pipeline processing of spider2
If you set the ʻITEM_PIPELINESitem in
custom_settings` for each spider as shown below,
The implementation file for pipeline processing can be individualized. [^ 1]
example_spider.py
class ExampleSpider(scrapy.Spider):
custom_settings = {
'ITEM_PIPELINES': {
'example_project.example_pipelines.ValidationPipeline': 100,
'example_project.example_pipelines.DBPipeline': 200,
}
}
example_spider2.py
class ExampleSpider2(scrapy.Spider):
custom_settings = {
'ITEM_PIPELINES': {
'example_project.example_pipelines2.DBPipeline': 100,
}
}
Individually routed to the following pipeline processing as set in custom_settings
.
example_pipelines.py
class ValidationPipeline(object):
def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
# example_spider.Processed when running py
class DBPipeline(object):
def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
# example_spider.Processed when running py
example_pipelines2.py
class DBPipeline(object):
def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
# example_spider2.Processed when running py
With the above, even if the number of scraping targets (spiders) increases, the visibility of the pipeline processing code can be maintained in good condition.
[^ 1]: Similarly, it seems that you can customize other items such as SPIDER_MIDDLEWARES
.
Recommended Posts