Regular serverless scraping with AWS lambda + scrapy Part 1.8

Introduction

To write the conclusion first, I couldn't get to the point where it works on Lambda. There are other methods, so if that works, I'll add it or post it as a separate article.

This time, Last time (1) We will put the created weather_spider.py on AWS lambda so that it can be executed serverlessly. It's been a while since the last time, but the reason is later ...

Target

Use Lambda to get Yahoo! Weather (Tokyo) data every 6 hours.

Method

This time, we will build various lambdas using the serverless application model (SAM).

Please see here for SAM.

The following assumes that the aws command and sam command can be executed.

Try

1. Create SAM Project (sam init)

Since we will implement it in Python 3.7 this time, specify python 3.7 for runtime and sam init.

$ sam init --runtime python3.7
[+] Initializing project structure...

Project generated: ./sam-app

Steps you can take next within the project folder
===================================================
[*] Invoke Function: sam local invoke HelloWorldFunction --event event.json
[*] Start API Gateway locally: sam local start-api

Read sam-app/README.md for further instructions

[*] Project initialization is now complete

A sam project is created in an instant like this.

The folder structure is as follows.


sam-app
├── README.md
├── events
│   └── event.json
├── hello_world
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-37.pyc
│   │   └── app.cpython-37.pyc
│   ├── app.py
│   └── requirements.txt
├── template.yaml
└── tests
    └── unit
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-37.pyc
        │   └── test_handler.cpython-37.pyc
        └── test_handler.py

Copy yahoo_weather_crawl created last time directly under sam-app

$ cp -r yahoo_weather_crawl sam-app/

$ cd sam-app/
$ ls
README.md		hello_world		tests
events			template.yaml		yahoo_weather_crawl

2. Modify weather_spider.py

Add a handler so that you can kick from lambda.

`spider/weather_spider.py`



# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem
from scrapy.crawler import CrawlerProcess

# spider
class YahooWeatherSpider(scrapy.Spider):

    name = "yahoo_weather_crawler"
    allowed_domains = ['weather.yahoo.co.jp']
    start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]

    #Extraction process for response
    def parse(self, response):
        #Announcement date and time
        yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
        table = response.xpath('//*[@id="yjw_week"]/table')

        #Date loop
        for day in range(2, 7):

            yield YahooWeatherCrawlItem(
                #Data extraction
                date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
                weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
                temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
                rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
                )

    # lambda handler
    def lambda_handler(event,context):
        process = CrawlerProcess({
            'FEED_FORMAT': 'json',
            'FEED_URI': '/tmp/result.json'
        })

        process.crawl(YahooWeatherCrawler)
        process.start()
        print('crawl success')

3. Modify template.yaml

Modify the tamplate.yaml created earlier with the sam init command.

`template.yaml`


AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Yahoo weather crawler template on SAM

Globals:
  Function:
    Timeout: 3

Resources:
  WeatherCrawlerFunction:
    Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction
    Properties:
      CodeUri: ./yahoo_weather_crawl/spiders
      Handler: weather_spider.lambda_handler
      Runtime: python3.7
      Events:
        WeatherCrawlEvent:
          Type: Schedule
          Properties:
            #Run every 6 hours daily
            Schedule: cron(0 */6 * * ? *)

Here, ʻEvents` is loaded with cron that runs every 6 hours every day. Make sure it starts from Cloudwatch events.

4. Create a shell for build (collect the modules to be deployed)

Put the modules you need to deploy on AWS in a folder named build.

But before that, this time I import scrapy and run Python, but scrapy's dependent library Inside, there is a library called lxml.

If you do pip install scrapy, lxml will be installed automatically, but When I upload to AWS Lambda with the Python 3.7 runtime, I can't load the module out of the box. (It took me a long time to struggle here ...)

So, this time, the secret sauce (lxml library compiled on EC2, see the article for details) created by this article is called lib. Save it in the named folder and copy it to the build folder within the build shell.

`build.sh`


# build

dir=yahoo_weather_crawl

echo 'Create a virtual environment'
python3 -m venv .venv

echo 'Enable virtual environment'
. .venv/bin/activate

rm -rf ${dir}/build

#Create build folder
echo '${dir}Build'
mkdir ${dir}/build

#pip install in build folder
echo 'requirements.pip install from txt'
pip3 install -r ${dir}/requirements.txt -t ${dir}/build

#Copy from lib folder to build folder
echo 'Copy the required modules from the lib folder to the build folder'
cp -rf ./lib/* ${dir}/build

#Copy py file
echo 'Copy the py file to the build folder'
cp -f ${dir}/*.py ${dir}/build
cp -f ${dir}/spiders/*.py ${dir}/build

# echo 'Disable the virtual environment'
deactivate

echo 'Build completed'

5. Create a shell for sam-deploy

Create a shell for deploy so that you can deploy from the command.

`deploy.sh`



# build
echo 'Build YahooWeatherCrawler'
sh build.sh

#Creating an S3 bucket to upload a template
#The bucket name must be unique around the world, so change the bucket name if you want to copy it.
if  aws s3 ls "s3://weather-crawl-bucket" 2>&1 | grep -q 'NoSuchBucket' ; then
    echo "weather-crawl-Create a bucket."
    aws s3 mb s3://weather-crawl-bucket
else
    echo "weather_crawl-Empty the bucket."
    aws s3 rm s3://weather-crawl-bucket --recursive
fi

#Creating a package for deployment
#Upload the created package to S3. Please specify the created bucket name.
echo "Create a package for deployment."
aws cloudformation package --template-file template.yaml \
--output-template-file output-template.yaml \
--s3-bucket weather-crawl-bucket

#Deploy
aws cloudformation deploy --template-file output-template.yaml \
--stack-name weather-crawler \
--capabilities CAPABILITY_IAM

6. Deploy

sam-app $sh deploy.sh
..
Successfully created/updated stack - weather-crawler

7. Run

Go to the AWS console and run! スクリーンショット 2019-12-31 18.31.53.png

This time another module's ImportError ... Building locally on a Mac seems a bit tricky, so I'd like to consider another method.

At the end

It's been over a month since I decided to post an article on Qiita every week, I ended up writing only three articles this year. (Once you get hooked, you can't get out!)

next year too? We will continue to do our best, so thank you.