To write the conclusion first, I couldn't get to the point where it works on Lambda. There are other methods, so if that works, I'll add it or post it as a separate article.
This time, Last time (1) We will put the created weather_spider.py
on AWS lambda so that it can be executed serverlessly.
It's been a while since the last time, but the reason is later ...
Use Lambda to get Yahoo! Weather (Tokyo) data every 6 hours.
This time, we will build various lambdas using the serverless application model (SAM).
Please see here for SAM.
The following assumes that the aws command and sam command can be executed.
Since we will implement it in Python 3.7 this time, specify python 3.7 for runtime and sam init.
$ sam init --runtime python3.7
[+] Initializing project structure...
Project generated: ./sam-app
Steps you can take next within the project folder
===================================================
[*] Invoke Function: sam local invoke HelloWorldFunction --event event.json
[*] Start API Gateway locally: sam local start-api
Read sam-app/README.md for further instructions
[*] Project initialization is now complete
A sam project is created in an instant like this.
The folder structure is as follows.
sam-app
├── README.md
├── events
│ └── event.json
├── hello_world
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-37.pyc
│ │ └── app.cpython-37.pyc
│ ├── app.py
│ └── requirements.txt
├── template.yaml
└── tests
└── unit
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ └── test_handler.cpython-37.pyc
└── test_handler.py
Copy yahoo_weather_crawl created last time directly under sam-app
$ cp -r yahoo_weather_crawl sam-app/
$ cd sam-app/
$ ls
README.md hello_world tests
events template.yaml yahoo_weather_crawl
Add a handler so that you can kick from lambda.
spider/weather_spider.py
# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem
from scrapy.crawler import CrawlerProcess
# spider
class YahooWeatherSpider(scrapy.Spider):
name = "yahoo_weather_crawler"
allowed_domains = ['weather.yahoo.co.jp']
start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]
#Extraction process for response
def parse(self, response):
#Announcement date and time
yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
table = response.xpath('//*[@id="yjw_week"]/table')
#Date loop
for day in range(2, 7):
yield YahooWeatherCrawlItem(
#Data extraction
date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
)
# lambda handler
def lambda_handler(event,context):
process = CrawlerProcess({
'FEED_FORMAT': 'json',
'FEED_URI': '/tmp/result.json'
})
process.crawl(YahooWeatherCrawler)
process.start()
print('crawl success')
Modify the tamplate.yaml
created earlier with the sam init
command.
template.yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Yahoo weather crawler template on SAM
Globals:
Function:
Timeout: 3
Resources:
WeatherCrawlerFunction:
Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction
Properties:
CodeUri: ./yahoo_weather_crawl/spiders
Handler: weather_spider.lambda_handler
Runtime: python3.7
Events:
WeatherCrawlEvent:
Type: Schedule
Properties:
#Run every 6 hours daily
Schedule: cron(0 */6 * * ? *)
Here, ʻEvents` is loaded with cron that runs every 6 hours every day. Make sure it starts from Cloudwatch events.
Put the modules you need to deploy on AWS in a folder named build.
But before that, this time I import scrapy and run Python, but scrapy's dependent library
Inside, there is a library called lxml
.
If you do pip install scrapy
, lxml will be installed automatically, but
When I upload to AWS Lambda with the Python 3.7 runtime, I can't load the module out of the box.
(It took me a long time to struggle here ...)
So, this time, the secret sauce (lxml library compiled on EC2, see the article for details) created by this article is called lib
. Save it in the named folder and copy it to the build folder within the build shell.
build.sh
# build
dir=yahoo_weather_crawl
echo 'Create a virtual environment'
python3 -m venv .venv
echo 'Enable virtual environment'
. .venv/bin/activate
rm -rf ${dir}/build
#Create build folder
echo '${dir}Build'
mkdir ${dir}/build
#pip install in build folder
echo 'requirements.pip install from txt'
pip3 install -r ${dir}/requirements.txt -t ${dir}/build
#Copy from lib folder to build folder
echo 'Copy the required modules from the lib folder to the build folder'
cp -rf ./lib/* ${dir}/build
#Copy py file
echo 'Copy the py file to the build folder'
cp -f ${dir}/*.py ${dir}/build
cp -f ${dir}/spiders/*.py ${dir}/build
# echo 'Disable the virtual environment'
deactivate
echo 'Build completed'
Create a shell for deploy so that you can deploy from the command.
deploy.sh
# build
echo 'Build YahooWeatherCrawler'
sh build.sh
#Creating an S3 bucket to upload a template
#The bucket name must be unique around the world, so change the bucket name if you want to copy it.
if aws s3 ls "s3://weather-crawl-bucket" 2>&1 | grep -q 'NoSuchBucket' ; then
echo "weather-crawl-Create a bucket."
aws s3 mb s3://weather-crawl-bucket
else
echo "weather_crawl-Empty the bucket."
aws s3 rm s3://weather-crawl-bucket --recursive
fi
#Creating a package for deployment
#Upload the created package to S3. Please specify the created bucket name.
echo "Create a package for deployment."
aws cloudformation package --template-file template.yaml \
--output-template-file output-template.yaml \
--s3-bucket weather-crawl-bucket
#Deploy
aws cloudformation deploy --template-file output-template.yaml \
--stack-name weather-crawler \
--capabilities CAPABILITY_IAM
sam-app $sh deploy.sh
..
Successfully created/updated stack - weather-crawler
Go to the AWS console and run!
This time another module's ImportError ... Building locally on a Mac seems a bit tricky, so I'd like to consider another method.
It's been over a month since I decided to post an article on Qiita every week, I ended up writing only three articles this year. (Once you get hooked, you can't get out!)
next year too? We will continue to do our best, so thank you.
Recommended Posts