Previously, for the purpose of studying, I tried to make a zip code search API with AWS lambda + API Gateway, but the zip code data is Scrapy. I scraped it and uploaded it to S3 for use.
This time, I will write about the points I was addicted to before deploying the project to Scrapy Cloud and executing the regular schedule.
For Scrapy Cloud, I referred to the following URL.
-Scrapy + Scrapy Cloud for a comfortable Python crawl + scraping life --Gunosy data analysis blog -Scraping with Python --Introduction to Scrapy 2nd step --Qiita
The figure below shows the overall workflow.

scraping hubFrom creating a scrapy project to deploying it to scrapinghub, the flow is as follows.
--Create a scrapy project
$ scrapy startproject {your project}
--Implemented locally
--Try scraping locally$ scrapy runspider {spider_file.py}
--Deploy to scraping hub$ shub deployscraping hubIt worked fine until I deployed it to scrapinghub, but there were some addictive points when I ran the job on scrapinghub.
Boto is used for AWS operations, but be careful because Scrapy Cloud pre-installed boto (v2). is.
If you specify requirements_file in scrapinghub.yml, you can install the required libraries, so you can use boto3.
If requirements_file is processed normally at the time of deployment, you can check the additionally installed libraries in requirements of Code & Deploys.

For AWS authentication information, go to Spider Settings-> Spider settigns and register the setting values as shown below.

Access from the code as follows.
from scrapy.conf import settings
s3 = boto.connect_s3(settings['AWS_ACCESS_KEY_ID'], settings['AWS_SECRET_ACCESS_KEY'])
If you want to check locally, write the setting value in settings.py.
However, if the authentication information exists in ~ / .aws / credentials, it is not necessary to describe the setting value.
settings.py
AWS_ACCESS_KEY_ID = 'xxxxxx'
AWS_SECRET_ACCESS_KEY = 'xxxxxx'
--File upload to S3 is extremely slow, scraping takes about 20 minutes
――It's for study purposes, so you don't have to worry about it, but I want to improve it somehow.
--Zip code data is updated once a month
--When the zip code data is updated, I want the API on the ʻAWS Lambdaside to automatically refer to the latest data as well. --Currently, the update date of the zip code data is managed as version information by stage variables. --I want to update the stage variable after uploading toS3` is complete
Initially, I was trying to run a Scrapy project on ʻAWS Lambdaas well as the API. However, I had to compress the source code into a ZIP file including the library and upload it, and as a result, it did not work well, so I tried deploying it toScrapy Cloud` obediently.
I was also invited to Bot Crawler Advent Calendar 2016, so even there, I made some bots using Scrapy Cloud and wrote an article. I'm going to.
See you soon.
Recommended Posts