How to separate pipeline processing code into files by spider in Scrapy

Introduction

If you increase the scraping target (spider file) in Scrapy, the pipeline processing code will increase in the same file, which will make the visibility worse and maintainability problems. I came. Finally, I was able to separate the pipeline implementation file for each spider, so I will introduce the method.

Provisional target method

A Scrapy project has a configuration file called setting.py. Even if there is a setting item of ʻITEM_PIPELINES in setting.py` and there are multiple spiders, Initially I thought I had no choice but to aggregate the pipeline processing into classes in the single implementation file specified here.

setting.py


ITEM_PIPELINES = {
    'example_project.pipelines.DBPipeline': 100,
}

--Pattern that processes and branches with the name of siper

I was routing with the spider name as the key, but it is clear that the visibility of the code gets worse as the spider increases.

pipelines.py


class DBPipeline(object):
    def process_item(self, item, spider):
        if spider.name in ['example_spider']:
            # example_Spider pipeline processing
        
        if spider.name in ['example_spider2']:
            # example_Pipeline processing of spider2

Conclusion

If you set the ʻITEM_PIPELINESitem incustom_settings` for each spider as shown below, The implementation file for pipeline processing can be individualized. [^ 1]

example_spider.py


class ExampleSpider(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines.ValidationPipeline': 100,
            'example_project.example_pipelines.DBPipeline': 200,
        }
    }

example_spider2.py


class ExampleSpider2(scrapy.Spider):
    custom_settings = {
        'ITEM_PIPELINES': {
            'example_project.example_pipelines2.DBPipeline': 100,
        }
    }

Individually routed to the following pipeline processing as set in custom_settings.

example_pipelines.py


class ValidationPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.Processed when running py

class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider.Processed when running py

example_pipelines2.py


class DBPipeline(object):
    def process_item(self, item: scrapy.Item, spider: scrapy.Spider):
        # example_spider2.Processed when running py

With the above, even if the number of scraping targets (spiders) increases, the visibility of the pipeline processing code can be maintained in good condition.

[^ 1]: Similarly, it seems that you can customize other items such as SPIDER_MIDDLEWARES.

Recommended Posts

How to separate pipeline processing code into files by spider in Scrapy
How to pass settings to Item Pipeline in Scrapy
How to run TensorFlow 1.0 code in 2.0
How to use functions in separate files Perl and Python versions
How to read CSV files in Pandas
Summary of how to import files in Python 3
How to make scrapy JSON output into Japanese
How to check / extract files in RPM package
How to get the files in the [Python] folder
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python
How to measure processing time in Python or Java
How to use variables in systemd Unit definition files
How to add page numbers to PDF files (in Python)
How to list numbers by dividing them into n
How to upload files in Django generic class view
How to reference static files in a Django project
Compare how to write processing for lists by language
How to use VS Code in venv environment on windows
How to fix multi-columns generated by Pandas groupby processing to single
Summary of how to write .proto files used in gRPC
Try HeloWorld in your own language (with How to & code)
How to search by string to use mysql json_contains in SQLAlchemy
How to separate strings with','
Write Spider tests in Scrapy
How to develop in Python
How to display a specified column of files in Linux (awk)
[For beginners] How to implement O'reilly sample code in Google Colab
How to implement Java code in the background of RedHat (LinuxONE)
How to make a string into an array or an array into a string in Python
How to combine all CSVs in a folder into one CSV
[FSL] How to peel off Atlas one by one and separate them
How to debug a Python program by remotely connecting to a Docker container in WSL2 environment with VS Code
"Cython" tutorial to make Python explosive: How to parse an Enum class defined in C ++ code into a Python Enum.