AWS-Perform web scraping regularly with Lambda + Python + Cron

AWS-Perform web scraping regularly with Lambda + Python + Cron

Introduction

`Although it is an article on Mac environment, the procedure is the same for Windows environment. Please read and try the environment-dependent part. ``

Purpose

After reading this article to the end, you will be able to:

No. Overview keyword
1 coding Python
2 WEB scraping Selenium,chromedriver,headless-chromium
3 Lambda settings Lambda

Execution environment

environment Ver.
macOS Catalina 10.15.3
Python 3.7.3
selenium 3.141.0

Source code

I think that understanding will deepen if you read while actually following the implementation contents and source code. Please use it by all means.

GitHub

Related articles

Features of AWS-Lambda

This service is a pay-as-you-go system. Please note.

-Features -Price

Overall flow

  1. Write Python code
  2. Create a zip for uploading to Lambda
  3. Create a Lambda function
  4. Upload the zip to your Lambda function
  5. (Supplement) Upload using Layers
  6. Set environment variables for Lambda functions
  7. (Supplement) Set environment variables when using Layers
  8. Set up Cron to run on a regular basis

1. Write Python code

coding

app/lambda_function.py


"""app/lambda_function.py
"""
import os

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By


def lambda_handler(event, context):
    """lambda_handler
    """
    print('event: {}'.format(event))
    print('context: {}'.format(context))

    headless_chromium = os.getenv('HEADLESS_CHROMIUM', '')
    chromedriver = os.getenv('CHROMEDRIVER', '')

    options = Options()
    options.binary_location = headless_chromium
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument('--single-process')
    options.add_argument('--disable-dev-shm-usage')

    driver = webdriver.Chrome(executable_path=chromedriver, options=options)
    driver.get('https://info.finance.yahoo.co.jp/fx/')
    usd_jpy = driver.find_element(By.ID, 'USDJPY_top_bid').text
    driver.close()
    driver.quit()

    return {
        'status_code': 200,
        'usd_jpy': usd_jpy
    }


if __name__ == '__main__':
    print(lambda_handler(event=None, context=None))

`If you want to work with Lambda, you need to set Options (). ``

2. Create a zip for uploading to Lambda

Script creation

--You need to change the version / path of chromedriver and headless-chromium to suit your environment. --Operation confirmed as of May 2020.

make_upload.sh


rm upload.zip
rm -r upload/
rm -r download/

mkdir -p download/bin
curl -L https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip -o download/chromedriver.zip
curl -L https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-55/stable-headless-chromium-amazonlinux-2017-03.zip -o download/headless-chromium.zip
unzip download/chromedriver.zip -d download/bin
unzip download/headless-chromium.zip -d download/bin

mkdir upload
cp -r download/bin upload/bin
cp app/lambda_function.py upload/
pip install -r app/requirements.txt -t upload/
cd upload/
zip -r ../upload.zip --exclude=__pycache__/* .
cd ../

rm -r upload/
rm -r download/

Create upload.zip

command_line.sh


sh make_upload.sh

3. Create a Lambda function

Prerequisites

--AWS account created --Lambda function role created

Lambda function creation

  1. Log in to AWS
  2. Open Lambda from the service
  3. Select Function from the submenu and click Create Function.
  4. Select Create from scratch, enter the function name, runtime, and execute role, and click create function.

4. Upload the zip to your Lambda function

`I uploaded it via S3 because it exceeded 10MB. ``

upload

  1. Display the function code section of your Lambda function
  2. Select `Upload .zip file``
  3. Select ʻupload.zip from Uploadand clickSave``

Upload when zip size exceeds 10MB

  1. Upload ʻupload.zip` to S3
  2. Display the function code section of your Lambda function
  3. Select ʻUpload File from Amazon S3`
  4. Enter the ʻAmazon S3 link URL and click Save``

5. (Supplement) Upload using Layers

Upload flow

  1. Separate bin from ʻupload.zip` and create a zip with bin alone
  2. Register bin in Layers
  3. Add Layers to your Lambda function
  4. Select Upload .zip file to upload ʻupload.zip`

5-1. Separate bin from ʻupload.zip` and create a zip with bin alone

bin.sh


bin.zip
├── chromedriver
└── headless-chromium

5-2. Register bin in Layers

  1. Select Layer in the submenu and click Create Layer
  2. Enter any name
  3. Click Upload and select bin.zip
  4. Select Compatible Runtime-Option and clickCreate

5-3. Add Layers to your Lambda function

  1. Select the Layers that appears in the center of the Designer section of your Lambda function.
  2. Click Add Layer of the layer displayed at the bottom.
  3. Select the bin.zip name registered in Layers and clickAdd

5-4. Select Upload .zip file to upload ʻupload.zip`

6. Set environment variables for Lambda functions

Environment variable settings

  1. Display the environment variable section of your Lambda function
Key value
CHROMEDRIVER /var/task/bin/chromedriver
HEADLESS_CHROMIUM /var/task/bin/headless-chromium

7. (Supplement) Set environment variables when using Layers

Layers environment variable settings

  1. Display the environment variable section of your Lambda function
Key value
CHROMEDRIVER /opt/bin/chromedriver
HEADLESS_CHROMIUM /opt/bin/headless-chromium

8. Set up Cron to run on a regular basis

Create Cron

  1. Go to the Designer section of your Lambda function and click Add Trigger
  2. In the trigger settings, select CloudWatch Events / Event Bridge
  3. In the rule, select `Create new rule``
  4. In the rule name, enter any rule name
  5. In the rule type, select Schedule expression
  6. In the schedule expression, enter cron (0 17? * MON-FRI *) and click Add

Example of Cron expression

frequency formula
10 am every day:15 (UTC) cron(15 10 * * ? *)
Every Monday to Friday 6 pm:00 cron(0 18 ? * MON-FRI *)
8 am on the first day of every month:00 cron(0 8 1 * ? *)
Every 10 minutes on weekdays cron(0/10 * ? * MON-FRI *)
8 am from monday to friday:00 to 5 pm:Every 5 minutes up to 55 cron(0/5 8-17 ? * MON-FRI *)
9am on the first Monday of every month cron(0 9 ? * 2#1 *)

Recommended Posts

AWS-Perform web scraping regularly with Lambda + Python + Cron
Web scraping with python + JupyterLab
Web scraping beginner with python
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python
Scraping with Python
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Scraping RSS with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Quick web scraping with Python (while supporting JavaScript loading)
Python beginners get stuck with their first web scraping
I tried scraping with Python
Manage cron jobs with python
Operate TwitterBot with Lambda, Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
[Python] Scraping in AWS Lambda
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Save images with web scraping
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Web API with Python + Falcon
Web scraping using Selenium (Python)
Scraping weather forecast with python
Web scraping using AWS lambda
Scraping with Selenium + Python Part 2
Web application with Python + Flask ② ③
I tried scraping with python
Streamline web search with python
Web application with Python + Flask ④
Web crawling, web scraping, character acquisition and image saving with python
[Python] Regularly export from CloudWatch Logs to S3 with Lambda
Let's execute commands regularly with cron!
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Getting Started with Python Web Applications
Face detection with Lambda (Python) + Rekognition