Regular serverless scraping with AWS lambda + scrapy Part 1.8

Introduction

To write the conclusion first, I couldn't get to the point where it works on Lambda. There are other methods, so if that works, I'll add it or post it as a separate article.

This time, Last time (1) We will put the created weather_spider.py on AWS lambda so that it can be executed serverlessly. It's been a while since the last time, but the reason is later ...

Target

Use Lambda to get Yahoo! Weather (Tokyo) data every 6 hours.

Method

This time, we will build various lambdas using the serverless application model (SAM).

Please see here for SAM.

The following assumes that the aws command and sam command can be executed.

Try

1. Create SAM Project (sam init)

Since we will implement it in Python 3.7 this time, specify python 3.7 for runtime and sam init.

$ sam init --runtime python3.7
[+] Initializing project structure...

Project generated: ./sam-app

Steps you can take next within the project folder
===================================================
[*] Invoke Function: sam local invoke HelloWorldFunction --event event.json
[*] Start API Gateway locally: sam local start-api

Read sam-app/README.md for further instructions

[*] Project initialization is now complete

A sam project is created in an instant like this.

The folder structure is as follows.


sam-app
├── README.md
├── events
│   └── event.json
├── hello_world
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-37.pyc
│   │   └── app.cpython-37.pyc
│   ├── app.py
│   └── requirements.txt
├── template.yaml
└── tests
    └── unit
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-37.pyc
        │   └── test_handler.cpython-37.pyc
        └── test_handler.py

Copy yahoo_weather_crawl created last time directly under sam-app

$ cp -r yahoo_weather_crawl sam-app/

$ cd sam-app/
$ ls
README.md		hello_world		tests
events			template.yaml		yahoo_weather_crawl

2. Modify weather_spider.py

Add a handler so that you can kick from lambda.

spider/weather_spider.py



# -*- coding: utf-8 -*-
import scrapy
from yahoo_weather_crawl.items import YahooWeatherCrawlItem
from scrapy.crawler import CrawlerProcess

# spider
class YahooWeatherSpider(scrapy.Spider):

    name = "yahoo_weather_crawler"
    allowed_domains = ['weather.yahoo.co.jp']
    start_urls = ["https://weather.yahoo.co.jp/weather/jp/13/4410.html"]

    #Extraction process for response
    def parse(self, response):
        #Announcement date and time
        yield YahooWeatherCrawlItem(announcement_date = response.xpath('//*[@id="week"]/p/text()').extract_first())
        table = response.xpath('//*[@id="yjw_week"]/table')

        #Date loop
        for day in range(2, 7):

            yield YahooWeatherCrawlItem(
                #Data extraction
                date=table.xpath('//tr[1]/td[%d]/small/text()' % day).extract_first(),
                weather=table.xpath('//tr[2]/td[%d]/small/text()' % day).extract_first(),
                temperature=table.xpath('//tr[3]/td[%d]/small/font/text()' % day).extract(),
                rainy_percent=table.xpath('//tr[4]/td[%d]/small/text()' % day).extract_first(),
                )

    # lambda handler
    def lambda_handler(event,context):
        process = CrawlerProcess({
            'FEED_FORMAT': 'json',
            'FEED_URI': '/tmp/result.json'
        })

        process.crawl(YahooWeatherCrawler)
        process.start()
        print('crawl success')

3. Modify template.yaml

Modify the tamplate.yaml created earlier with the sam init command.

template.yaml


AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Yahoo weather crawler template on SAM

Globals:
  Function:
    Timeout: 3

Resources:
  WeatherCrawlerFunction:
    Type: AWS::Serverless::Function # More info about Function Resource: https://github.com/awslabs/serverless-application-model/blob/master/versions/2016-10-31.md#awsserverlessfunction
    Properties:
      CodeUri: ./yahoo_weather_crawl/spiders
      Handler: weather_spider.lambda_handler
      Runtime: python3.7
      Events:
        WeatherCrawlEvent:
          Type: Schedule
          Properties:
            #Run every 6 hours daily
            Schedule: cron(0 */6 * * ? *)

Here, ʻEvents` is loaded with cron that runs every 6 hours every day. Make sure it starts from Cloudwatch events.

4. Create a shell for build (collect the modules to be deployed)

Put the modules you need to deploy on AWS in a folder named build.

But before that, this time I import scrapy and run Python, but scrapy's dependent library Inside, there is a library called lxml.

If you do pip install scrapy, lxml will be installed automatically, but When I upload to AWS Lambda with the Python 3.7 runtime, I can't load the module out of the box. (It took me a long time to struggle here ...)

So, this time, the secret sauce (lxml library compiled on EC2, see the article for details) created by this article is called lib. Save it in the named folder and copy it to the build folder within the build shell.

build.sh


# build

dir=yahoo_weather_crawl

echo 'Create a virtual environment'
python3 -m venv .venv

echo 'Enable virtual environment'
. .venv/bin/activate

rm -rf ${dir}/build

#Create build folder
echo '${dir}Build'
mkdir ${dir}/build

#pip install in build folder
echo 'requirements.pip install from txt'
pip3 install -r ${dir}/requirements.txt -t ${dir}/build

#Copy from lib folder to build folder
echo 'Copy the required modules from the lib folder to the build folder'
cp -rf ./lib/* ${dir}/build

#Copy py file
echo 'Copy the py file to the build folder'
cp -f ${dir}/*.py ${dir}/build
cp -f ${dir}/spiders/*.py ${dir}/build

# echo 'Disable the virtual environment'
deactivate

echo 'Build completed'

5. Create a shell for sam-deploy

Create a shell for deploy so that you can deploy from the command.

deploy.sh



# build
echo 'Build YahooWeatherCrawler'
sh build.sh

#Creating an S3 bucket to upload a template
#The bucket name must be unique around the world, so change the bucket name if you want to copy it.
if  aws s3 ls "s3://weather-crawl-bucket" 2>&1 | grep -q 'NoSuchBucket' ; then
    echo "weather-crawl-Create a bucket."
    aws s3 mb s3://weather-crawl-bucket
else
    echo "weather_crawl-Empty the bucket."
    aws s3 rm s3://weather-crawl-bucket --recursive
fi

#Creating a package for deployment
#Upload the created package to S3. Please specify the created bucket name.
echo "Create a package for deployment."
aws cloudformation package --template-file template.yaml \
--output-template-file output-template.yaml \
--s3-bucket weather-crawl-bucket

#Deploy
aws cloudformation deploy --template-file output-template.yaml \
--stack-name weather-crawler \
--capabilities CAPABILITY_IAM

6. Deploy

sam-app $sh deploy.sh
..
Successfully created/updated stack - weather-crawler

7. Run

Go to the AWS console and run! スクリーンショット 2019-12-31 18.31.53.png

This time another module's ImportError ... Building locally on a Mac seems a bit tricky, so I'd like to consider another method.

At the end

It's been over a month since I decided to post an article on Qiita every week, I ended up writing only three articles this year. (Once you get hooked, you can't get out!)

next year too? We will continue to do our best, so thank you.

Recommended Posts

Regular serverless scraping with AWS lambda + scrapy Part 1.8
Serverless scraping on a regular basis with AWS lambda + scrapy Part 1
Serverless scraping using selenium with [AWS Lambda] -Part 1-
Scraping with scrapy shell
Serverless application with AWS SAM! (APIGATEWAY + Lambda (Python))
[AWS] Play with Step Functions (SAM + Lambda) Part.3 (Branch)
Deploy Python3 function with Serverless Framework on AWS Lambda
[AWS] Play with Step Functions (SAM + Lambda) Part.1 (Basic)
[AWS] Play with Step Functions (SAM + Lambda) Part.2 (Parameter)
Scraping with Selenium + Python Part 1
Festive scraping with Python, scrapy
Deploy Django serverless with Lambda
Easy web scraping with Scrapy
AWS Lambda with PyTorch [Lambda import]
Web scraping using AWS lambda
Scraping with Selenium + Python Part 2
[AWS] Create API with API Gateway + Lambda
Using Lambda with AWS Amplify with Go
Notify HipChat with AWS Lambda (Python)
Install pip in Serverless Framework and AWS Lambda with Python environment
How to create a serverless machine learning API with AWS Lambda
Automate simple tasks with Python Part1 Scraping
[AWS] Link Lambda and S3 with boto3
[Part1] Scraping with Python → Organize to csv!
Connect to s3 with AWS Lambda Python
[AWS] Do SSI-like things with S3 / Lambda
Touch AWS with Serverless Framework and Python
Python + Selenium + Headless Chromium with aws lambda
I just did FizzBuzz with AWS Lambda
AWS-Perform web scraping regularly with Lambda + Python + Cron
LINE BOT with Python + AWS Lambda + API Gateway
[AWS] Try tracing API Gateway + Lambda with X-Ray
I tried connecting AWS Lambda with other services
Infrastructure construction automation with CloudFromation + troposphere + AWS Lambda
Scraping with selenium
Scraping with Python
Scraping with Python
Scraping with Selenium
Restart with Scrapy
Python: Scraping Part 1
Python: Scraping Part 2
I wanted to operate google spread sheet with AWS lambda, so I tried it [Part 2]
Dynamic HTML pages made with AWS Lambda and Python
Scraping with Python, posting on TwitterBot, regular execution on Heroku
Create Python version Lambda function (+ Lambda Layer) with Serverless Framework
Create a Layer for AWS Lambda Python with Docker
I want to AWS Lambda with Python on Mac!
Manage your Amazon CloudWatch loggroup retention with AWS Lambda
Make ordinary tweets fleet-like with AWS Lambda and Python