Introduction

When scraping with Python, I stumbled around authentication and implemented it using Selenium to solve it quickly, but I stumbled when I tried to run it with Lambda, so I will leave it as a memo (a lot of writing is miscellaneous) It may be ...)

This article is implemented without using SAM: bow:

Work environment

Windows 10
Python 3.7 is preferred, but 3.8 usually doesn't help
(It depends on how you use it, so it would be nice if you can switch it with version control)

Addictive point 1

Q. Should I use headless-chromium and chromedriver to use Selenium? A. Compress these two and register them in the layer

Details

The following are provided as headless browsers that can be used with AWS Lambda This, chrome and chromedriver versions are combined and compressed as one file https://github.com/adieuadieu/serverless-chrome

Reference article https://qiita.com/mishimay/items/afd7f247f101fbe25f30

How to register a layer

Click "Create Layer"
Upload the zip file with a name and description that you can understand (compressing chrome and chromedriver together can easily exceed 10MB, so you will need to register after uploading to S3)
Click "Add Layer" in Lambda and select the added layer.

Supplementary explanation

If you register it in the layer, the file will be placed under "/ opt/xxxx". For example, if you create a chrome directory, place the "serverless-chrome, chromedriver" file under it and register the compressed one in the layer, the definition will be as follows.

Example

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options

    options = Options()
    options.binary_location = '/opt/chrome/headless-chromium'
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    options.add_argument("--single-process")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1280x1024")
    options.add_argument("--disable-application-cache")
    options.add_argument("--disable-infobars")
    options.add_argument("--hide-scrollbars")
    options.add_argument("--enable-logging")
    options.add_argument("--log-level=0")
    options.add_argument("--ignore-certificate-errors")

    driver = webdriver.Chrome(
        options=options, executable_path='/opt/chrome/chromedriver')

Addictive point 2

Q. Selenium doesn't work on Lambda A. If it is python 3.8, it doesn't work because it is Amazon Linux2 or some libraries are missing.

I get a status code 127 error at run time (it seems that the library is missing when I do a quick search) There may be some workaround, but the quickest way to deal with it is to run it in a python3.7 environment.

Addictive point 3

It's slow to implement everything with Selenium

phenomenon

It took 12 minutes to parse a table of 70 pages and about 2100 items (30 items per page) with 512MB of memory and register it in dynamoDB.

Cause

Gripping the element with Selenium was very costly

Countermeasures

For example, after loading a page, let BeautifulSoup handle the processing after parsing.

While lxml is convenient because xPath can be used, it depends on the environment because it uses C language. Therefore, I used BeautifulSoup because I found out during implementation that it would take more time and effort to deploy in Amazon Linux2 environment.

Example)

driver.get('https://example.com')
html = BeautifulSoup(driver.page_source, 'html.parser')
table = html.select_one('table')
rows = table .findAll('tr')
for row in rows:
    cells = row.findAll('td')
# todo 
driver.quit()

As a result, the processing speed improved to 12 minutes => 2 minutes under the same conditions.

Small story understood during implementation

When executing python of lambda, it seems that the library under "/ opt/python /" is loaded. Therefore, by spitting out lib with "pip install -r ./requirements.txt -t." etc. and registering the zip-compressed one in the layer, only the source file to be executed can be uploaded and Lambda's WEB You can now edit the source from the screen of

at the end

I didn't think I could run Selenium on Lambda, I was thinking of putting cron on my local PC's WSL, but I knew I didn't have to do that.

Layer awesome: laughing: Next I wanted to be able to deploy these with template.yml and layer yml files using sam: sweat:

Summary of points I was addicted to running Selenium on AWS Lambda (python)