Previously, for the purpose of studying, I tried to make a zip code search API with AWS lambda + API Gateway, but the zip code data is Scrapy
. I scraped it and uploaded it to S3
for use.
This time, I will write about the points I was addicted to before deploying the project to Scrapy Cloud
and executing the regular schedule.
For Scrapy Cloud
, I referred to the following URL.
-Scrapy + Scrapy Cloud for a comfortable Python crawl + scraping life --Gunosy data analysis blog -Scraping with Python --Introduction to Scrapy 2nd step --Qiita
The figure below shows the overall workflow.
scraping hub
From creating a scrapy
project to deploying it to scrapinghub
, the flow is as follows.
--Create a scrapy project
$ scrapy startproject {your project}
--Implemented locally
--Try scraping locally$ scrapy runspider {spider_file.py}
--Deploy to scraping hub
$ shub deploy
scraping hub
It worked fine until I deployed it to scrapinghub
, but there were some addictive points when I ran the job on scrapinghub
.
Boto is used for AWS operations, but be careful because Scrapy Cloud
pre-installed boto
(v2). is.
If you specify requirements_file
in scrapinghub.yml
, you can install the required libraries, so you can use boto3.
If requirements_file
is processed normally at the time of deployment, you can check the additionally installed libraries in requirements
of Code & Deploys
.
For AWS authentication information, go to Spider Settings
-> Spider settigns
and register the setting values as shown below.
Access from the code as follows.
from scrapy.conf import settings
s3 = boto.connect_s3(settings['AWS_ACCESS_KEY_ID'], settings['AWS_SECRET_ACCESS_KEY'])
If you want to check locally, write the setting value in settings.py
.
However, if the authentication information exists in ~ / .aws / credentials
, it is not necessary to describe the setting value.
settings.py
AWS_ACCESS_KEY_ID = 'xxxxxx'
AWS_SECRET_ACCESS_KEY = 'xxxxxx'
--File upload to S3
is extremely slow, scraping takes about 20 minutes
――It's for study purposes, so you don't have to worry about it, but I want to improve it somehow.
--Zip code data is updated once a month
--When the zip code data is updated, I want the API on the ʻAWS Lambdaside to automatically refer to the latest data as well. --Currently, the update date of the zip code data is managed as version information by stage variables. --I want to update the stage variable after uploading to
S3` is complete
Initially, I was trying to run a Scrapy project on ʻAWS Lambdaas well as the API. However, I had to compress the source code into a ZIP file including the library and upload it, and as a result, it did not work well, so I tried deploying it to
Scrapy Cloud` obediently.
I was also invited to Bot Crawler Advent Calendar 2016, so even there, I made some bots using Scrapy Cloud
and wrote an article. I'm going to.
See you soon.
Recommended Posts