A note when trying to interact with a website using PhantomJS on Lambda. This article does not scrape, but it introduces the page to be displayed with phantomJS. If I can go that far, I think I can do the rest.

final goals

When I put the file in S3, the lambda function is executed. As the content, phantomJS of python2.7 selenium library is used to output the html of the google.com site.

If you can do so far, I think it will be possible to extract information from your favorite sites.

Let's check each one.

environment

	version
python	2.7

Be aware of the syntax, as AWS lambda supports 2.7. Reference) https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/current-supported-versions.html

We will check the various required libraries below.

Selenium
PhantomJS

selenium https://pypi.python.org/pypi/selenium Install from this site. スクリーンショット 2016-07-08 1.41.46.png

Please download the tar ball from the bottom of the above section. From this folder, the entire selenium folder under the py directory will be used after this.

PhantomJS http://phantomjs.org/download.html Please download the zip containing the phantomjs executable file from Linux 64-bit on this site. I will use phantomjs under bin after this.

How to download various libraries and executable files other than the above https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html You can follow this as well. Or rather, I overlooked the fact that there was such a thing. ..

Let's put together the collected materials and write the code.

file organization

lambda_function.py
selenium/
phantomjs
setup.cfg.py

Create the above structure in a certain directory. The substance of the process will be described in lambda_function.py. This file name can be anything, but keep in mind that you will use it when registering in the AWS Management Console.

lambda_function.py

`python`


#!/usr/bin/env python

import time # for sleep
import os   # for path
import selenium 
from selenium import webdriver 
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities


def lambda_handler(event, context):
  # set user agent
  user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")

  dcap = dict(DesiredCapabilities.PHANTOMJS)
  dcap["phantomjs.page.settings.userAgent"] = user_agent
  dcap["phantomjs.page.settings.javascriptEnabled"] = True

  browser = webdriver.PhantomJS(service_log_path=os.path.devnull, executable_path="/var/task/phantomjs", service_args=['--ignore-ssl-errors=true'], desired_capabilities=dcap)

  browser.get('http://google.com')
  html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
  print html

First, read various things. User Agent can be specified in PhantomJS, so be sure to describe it as well.

setup.cfg.py

[install]
prefix=

This file is called the setup configuration file https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html

Let's put it in as described here.

Before moving to the management screen ...

Files must be zipped so that they can be uploaded Go to the directory that has the above file structure and

$ zip -r upload.zip *

Type the above command. Then, ʻupload.zip` is generated in that hierarchy.

Let's register the information on the AWS management screen.

At Select blueprint

python 2.7, with select runtime Select s3-get-object-python as the Blueprint.

At Configure triggers

Select Bucket for S3 (let's say test.com)
Put in Event type
Prefix: test/
Suffix: Enter as you like.

Although it is various items, the bucket and directory of S3 that will be the trigger to execute the lambda function are decided here. With the above settings

test.com/test/

It will be executed with the trigger when the file is placed in. It's important, but make sure to check Enable trigger.

Click Next at the bottom to move to the next screen.

Configure function Lambda function code Name and Description are the names to be confirmed from the management screen. If you put it on properly, you will not be able to understand what kind of processing it was, so let's make a point.

I think the Runtime is Python 2.7 selected.

Lambda function code Conde entry type is

Edit code inline
Upload a.ZIP file
Upload a file from Amazon S3

There are three types. Since we created the Zip earlier, select Upload a .ZIP file in the middle. .. In the Function package, let's upload the ʻupload.zip` created earlier.

Lambda function handler and role

The input of this Handler is extremely important. Please note that this input item is the main file name.function name. Earlier File name: lambda_function Function name: lambda_handler Since it was created in, enter lambda_function.lambda_handler in the Handler.

Then select Role. If it does not exist, create a new one, and if it already exists, select the existing role from the Existing role.

Advanced settings Here you can choose the Memory (MB) and Timeout times. Memory ranges from 128KB to 1536KB It seems that the performance of the CPU improves according to the memory. (http://qiita.com/hama_du/items/12303d9f9cb800db14d3)

I will put it for reference. スクリーンショット 2016-07-08 2.36.21.png (source: https://aws.amazon.com/jp/lambda/pricing/)

The Timeout time can be selected up to 5 minutes.

Finally, let's choose whether to run in the VPC.

Complete!

Once you have registered your lambda function スクリーンショット 2016-07-08 1.30.01.png

Click the Test button and press Save and test.

スクリーンショット 2016-07-08 1.32.04.png

When the function is executed, you can see that the contents of print html can be output in the above form. If you press click in the above image, you can confirm that the function is executed for the time being because google html is actually printed. Even if you put the file in S3, the function is actually executed.

that's all.

Postscript

I don't write long articles very often, and I think there are some mistakes, so I think I'll add / correct them little by little.

Edit 1) Edited because there was a discrepancy between the title and the content (7/11)

Using PhantomJS with AWS Lambda until displaying the html of the website