When scraping with Python, I stumbled around authentication and implemented it using Selenium to solve it quickly, but I stumbled when I tried to run it with Lambda, so I will leave it as a memo (a lot of writing is miscellaneous) It may be ...)
Q. Should I use headless-chromium and chromedriver to use Selenium? A. Compress these two and register them in the layer
The following are provided as headless browsers that can be used with AWS Lambda This, chrome and chromedriver versions are combined and compressed as one file https://github.com/adieuadieu/serverless-chrome
Reference article https://qiita.com/mishimay/items/afd7f247f101fbe25f30
If you register it in the layer, the file will be placed under "/ opt/xxxx". For example, if you create a chrome directory, place the "serverless-chrome, chromedriver" file under it and register the compressed one in the layer, the definition will be as follows.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.binary_location = '/opt/chrome/headless-chromium'
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--single-process")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1280x1024")
options.add_argument("--disable-application-cache")
options.add_argument("--disable-infobars")
options.add_argument("--hide-scrollbars")
options.add_argument("--enable-logging")
options.add_argument("--log-level=0")
options.add_argument("--ignore-certificate-errors")
driver = webdriver.Chrome(
options=options, executable_path='/opt/chrome/chromedriver')
Q. Selenium doesn't work on Lambda A. If it is python 3.8, it doesn't work because it is Amazon Linux2 or some libraries are missing.
I get a status code 127 error at run time (it seems that the library is missing when I do a quick search) There may be some workaround, but the quickest way to deal with it is to run it in a python3.7 environment.
It's slow to implement everything with Selenium
It took 12 minutes to parse a table of 70 pages and about 2100 items (30 items per page) with 512MB of memory and register it in dynamoDB.
Gripping the element with Selenium was very costly
For example, after loading a page, let BeautifulSoup handle the processing after parsing.
Example)
driver.get('https://example.com')
html = BeautifulSoup(driver.page_source, 'html.parser')
table = html.select_one('table')
rows = table .findAll('tr')
for row in rows:
cells = row.findAll('td')
# todo
driver.quit()
As a result, the processing speed improved to 12 minutes => 2 minutes under the same conditions.
When executing python of lambda, it seems that the library under "/ opt/python /" is loaded. Therefore, by spitting out lib with "pip install -r ./requirements.txt -t." etc. and registering the zip-compressed one in the layer, only the source file to be executed can be uploaded and Lambda's WEB You can now edit the source from the screen of
I didn't think I could run Selenium on Lambda, I was thinking of putting cron on my local PC's WSL, but I knew I didn't have to do that.
Layer awesome: laughing: Next I wanted to be able to deploy these with template.yml and layer yml files using sam: sweat:
Recommended Posts