A note when trying to interact with a website using PhantomJS on Lambda. This article does not scrape, but it introduces the page to be displayed with phantomJS. If I can go that far, I think I can do the rest.
When I put the file in S3, the lambda function is executed. As the content, phantomJS of python2.7 selenium library is used to output the html of the google.com site.
If you can do so far, I think it will be possible to extract information from your favorite sites.
Let's check each one.
version | |
---|---|
python | 2.7 |
Be aware of the syntax, as AWS lambda supports 2.7. Reference) https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/current-supported-versions.html
We will check the various required libraries below.
selenium https://pypi.python.org/pypi/selenium Install from this site.
Please download the tar ball from the bottom of the above section. From this folder, the entire selenium folder under the py directory will be used after this.
PhantomJS http://phantomjs.org/download.html Please download the zip containing the phantomjs executable file from Linux 64-bit on this site. I will use phantomjs under bin after this.
How to download various libraries and executable files other than the above https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html You can follow this as well. Or rather, I overlooked the fact that there was such a thing. ..
Create the above structure in a certain directory.
The substance of the process will be described in lambda_function.py
. This file name can be anything, but keep in mind that you will use it when registering in the AWS Management Console.
lambda_function.py
python
#!/usr/bin/env python
import time # for sleep
import os # for path
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
def lambda_handler(event, context):
# set user agent
user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
dcap["phantomjs.page.settings.javascriptEnabled"] = True
browser = webdriver.PhantomJS(service_log_path=os.path.devnull, executable_path="/var/task/phantomjs", service_args=['--ignore-ssl-errors=true'], desired_capabilities=dcap)
browser.get('http://google.com')
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print html
First, read various things. User Agent can be specified in PhantomJS, so be sure to describe it as well.
setup.cfg.py
[install]
prefix=
This file is called the setup configuration file https://docs.aws.amazon.com/ja_jp/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html
Let's put it in as described here.
Files must be zipped so that they can be uploaded Go to the directory that has the above file structure and
$ zip -r upload.zip *
Type the above command. Then, ʻupload.zip` is generated in that hierarchy.
python 2.7, with select runtime Select s3-get-object-python as the Blueprint.
Although it is various items, the bucket and directory of S3 that will be the trigger to execute the lambda function are decided here. With the above settings
test.com/test/
It will be executed with the trigger when the file is placed in. It's important, but make sure to check Enable trigger.
Click Next at the bottom to move to the next screen.
Configure function Lambda function code Name and Description are the names to be confirmed from the management screen. If you put it on properly, you will not be able to understand what kind of processing it was, so let's make a point.
I think the Runtime is Python 2.7 selected.
Lambda function code Conde entry type is
There are three types. Since we created the Zip earlier, select Upload a .ZIP file in the middle. .. In the Function package, let's upload the ʻupload.zip` created earlier.
Lambda function handler and role
The input of this Handler is extremely important. Please note that this input item is the main file name.function name. Earlier
File name: lambda_function
Function name: lambda_handler
Since it was created in, enter lambda_function.lambda_handler
in the Handler.
Then select Role. If it does not exist, create a new one, and if it already exists, select the existing role from the Existing role.
Advanced settings Here you can choose the Memory (MB) and Timeout times. Memory ranges from 128KB to 1536KB It seems that the performance of the CPU improves according to the memory. (http://qiita.com/hama_du/items/12303d9f9cb800db14d3)
I will put it for reference. (source: https://aws.amazon.com/jp/lambda/pricing/)
The Timeout time can be selected up to 5 minutes.
Finally, let's choose whether to run in the VPC.
Once you have registered your lambda function
Click the Test button and press Save and test.
When the function is executed, you can see that the contents of print html
can be output in the above form.
If you press click in the above image, you can confirm that the function is executed for the time being because google html is actually printed. Even if you put the file in S3, the function is actually executed.
that's all.
I don't write long articles very often, and I think there are some mistakes, so I think I'll add / correct them little by little.
Edit 1) Edited because there was a discrepancy between the title and the content (7/11)
Recommended Posts