Introduction

If you increase the scraping target (spider file) in Scrapy, the pipeline processing code will increase in the same file, which will make the visibility worse and maintainability problems. I came. Finally, I was able to separate the pipeline implementation file for each spider, so I will introduce the method.

Provisional target method

A Scrapy project has a configuration file called setting.py. Even if there is a setting item of ʻITEM_PIPELINES in setting.py` and there are multiple spiders, Initially I thought I had no choice but to aggregate the pipeline processing into classes in the single implementation file specified here.

`setting.py`


ITEM_PIPELINES = {
    'example_project.pipelines.DBPipeline': 100,
}

--Pattern that processes and branches with the name of siper

I was routing with the spider name as the key, but it is clear that the visibility of the code gets worse as the spider increases.

`pipelines.py`


class DBPipeline(object):
    def process_item(self, item, spider):
        if spider.name in ['example_spider']:
            # example_Spider pipeline processing
        
        if spider.name in ['example_spider2']:
            # example_Pipeline processing of spider2

Conclusion

If you set the ʻITEM_PIPELINESitem incustom_settings` for each spider as shown below, The implementation file for pipeline processing can be individualized. [^ 1]

`example_spider.py`


class ExampleSpider(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines.ValidationPipeline': 100,
            'example_project.example_pipelines.DBPipeline': 200,
        }
    }

`example_spider2.py`


class ExampleSpider２(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines2.DBPipeline': 100,
        }
    }

Individually routed to the following pipeline processing as set in custom_settings.

`example_pipelines.py`


class ValidationPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.Processed when running py

class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.Processed when running py

`example_pipelines2.py`


class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider2.Processed when running py

With the above, even if the number of scraping targets (spiders) increases, the visibility of the pipeline processing code can be maintained in good condition.

[^ 1]: Similarly, it seems that you can customize other items such as SPIDER_MIDDLEWARES.

How to separate pipeline processing code into files by spider in Scrapy

Introduction

Provisional target method

setting.py

pipelines.py

Conclusion

example_spider.py

example_spider2.py

example_pipelines.py