Prepare an environment for scraping with the following stack to support sites with Javascript
Required library Local execution
pip install python-lambda-local
Deploy
pip install lambda-uploader
If you want to try it quickly, please visit the repository below. https://github.com/akichim21/python_scraping_in_lambda
I am doing to generate html with js executed by selenium (driver: phantomjs) and extract the title. Finally, a script that just kills phantomjs with close () and quit ().
lambda_function.py
#!/usr/bin/env python
import time # for sleep
import os # for path
import signal
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
def lambda_handler(event, context):
# set user agent
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
dcap["phantomjs.page.settings.javascriptEnabled"] = True
browser = webdriver.PhantomJS(
service_log_path=os.path.devnull,
executable_path="./phantomjs",
service_args=['--ignore-ssl-errors=true', '--load-images=no', '--ssl-protocol=any'],
desired_capabilities=dcap
)
browser.get('http://google.com')
title = browser.title
browser.close()
browser.quit()
return title
Not used this time, but used when passing arguments like event ["key1"]
event.json
{
"key3": "value3",
"key2": "value2",
"key1": "value1"
}
Create & replace role. This time it is not 50M or more, so I will not use s3, but if I use other libraries, it will soon be 50MB or more, so s3 is required. If you set s3_bucket, the file will be uploaded via s3.
Since name is often used at runtime, use an appropriate name in production. Set memory, timeout, etc. appropriately according to the script.
lambda.json
{
"name": "python_scraping_test",
"description": "python_scraping_test",
"region": "ap-northeast-1",
"runtime": "python2.7",
"handler": "lambda_function.lambda_handler",
"role": "arn:aws:iam::00000000:role/lambda_basic_execution",
"timeout": 60,
"memory": 128,
"variables": {
"production": "True"
},
"ignore": [
"\\.git.*",
"/.*\\.pyc$",
"/.*\\.zip$"
]
}
Insert only selenium with pip. Save phantomjs as a binary.
requirements.txt
selenium
Execute using python-lambda-local f is the function name and t is the timeout (s).
python-lambda-local -f lambda_handler -l ./ -t 60 lambda_function.py event.json
result
[root - INFO - 2017-03-19 08:16:05,271] Event: {u'test': u'test'}
[root - INFO - 2017-03-19 08:16:05,271] START RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] END RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] RESULT:
Google
[root - INFO - 2017-03-19 08:16:06,766] REPORT RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6 Duration: 1494.06 ms
Since ~ / .aws / credentials are used, if it is not set, set it
Install if you don't have aws-cli
pip install awscli
aws configure
If virtualenv such as conda is in a special place, keep track of the place. Location
pip show virtualenv
lambda-uploader
##If virtualenv is a special location, take the Location directory as an argument(When vagrant, anaconda3, env is py2, it looks like this
lambda-uploader --virtualenv=/home/vagrant/.pyenv/versions/anaconda3-4.1.0/envs/py2/lib/python2.7/site-packages
Recommended Posts