[Python] Scraping in AWS Lambda

Prepare an environment for scraping with the following stack to support sites with Javascript

Phantomjs
Selenium

Required library Local execution

pip install python-lambda-local

Deploy

pip install lambda-uploader

If you want to try it quickly, please visit the repository below. https://github.com/akichim21/python_scraping_in_lambda

Execution script

I am doing to generate html with js executed by selenium (driver: phantomjs) and extract the title. Finally, a script that just kills phantomjs with close () and quit ().

`lambda_function.py`


#!/usr/bin/env python

import time # for sleep
import os   # for path
import signal
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def lambda_handler(event, context):
  # set user agent
  user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")

  dcap = dict(DesiredCapabilities.PHANTOMJS)
  dcap["phantomjs.page.settings.userAgent"] = user_agent
  dcap["phantomjs.page.settings.javascriptEnabled"] = True

  browser = webdriver.PhantomJS(
              service_log_path=os.path.devnull,
              executable_path="./phantomjs",
              service_args=['--ignore-ssl-errors=true', '--load-images=no', '--ssl-protocol=any'],
              desired_capabilities=dcap
            )

  browser.get('http://google.com')
  title = browser.title
  browser.close()
  browser.quit()
  return title

Json passed as an argument

Not used this time, but used when passing arguments like event ["key1"]

`event.json`


{
  "key3": "value3",
  "key2": "value2",
  "key1": "value1"
}

Lambda settings json

Create & replace role. This time it is not 50M or more, so I will not use s3, but if I use other libraries, it will soon be 50MB or more, so s3 is required. If you set s3_bucket, the file will be uploaded via s3.

"s3_bucket": "xxx-lambda"
"s3_key": "deploy/lambda_function.zip"

Since name is often used at runtime, use an appropriate name in production. Set memory, timeout, etc. appropriately according to the script.

`lambda.json`


{
  "name": "python_scraping_test",
  "description": "python_scraping_test",
  "region": "ap-northeast-1",
  "runtime": "python2.7",
  "handler": "lambda_function.lambda_handler",
  "role": "arn:aws:iam::00000000:role/lambda_basic_execution",
  "timeout": 60,
  "memory": 128,
  "variables": {
    "production": "True"
  },
  "ignore": [
    "\\.git.*",
    "/.*\\.pyc$",
    "/.*\\.zip$"
  ]
}

Dependencies

Insert only selenium with pip. Save phantomjs as a binary.

`requirements.txt`


selenium

Local execution command

Execute using python-lambda-local f is the function name and t is the timeout (s).

python-lambda-local -f lambda_handler -l ./ -t 60 lambda_function.py event.json

`result`


[root - INFO - 2017-03-19 08:16:05,271] Event: {u'test': u'test'}
[root - INFO - 2017-03-19 08:16:05,271] START RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] END RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] RESULT:
Google
[root - INFO - 2017-03-19 08:16:06,766] REPORT RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6  Duration: 1494.06 ms

deploy command

Since ~ / .aws / credentials are used, if it is not set, set it

Install if you don't have aws-cli

pip install awscli
aws configure

If virtualenv such as conda is in a special place, keep track of the place. Location

pip show virtualenv

lambda-uploader

##If virtualenv is a special location, take the Location directory as an argument(When vagrant, anaconda3, env is py2, it looks like this
lambda-uploader --virtualenv=/home/vagrant/.pyenv/versions/anaconda3-4.1.0/envs/py2/lib/python2.7/site-packages