Periodically execute Python Script on AWS Data Pipeline

Introduction

I think there are quite a few Needs who want to run Python Script on AWS on a regular basis. It can be realized by setting up EC2 and executing it with cron, but here I will explain how to realize it using the function of AWS Data Pipeline.

However, as a limitation of Data Pipeline, please note that the execution cycle can only be set to 15 minutes or more, and it cannot be executed every minute.

It is also possible to periodically execute the Lambda Function in Data Pipeline. If the Script is Node.js or Java, I think it's easier to do it this way.

Overall flow

The flow of items to be set is as follows. It is assumed that the Python Script itself has already been completed.

Place Python Script on S3 Bucket

Create an S3 bucket to put Python Script (existing bucket can be used)
Upload Python Script to S3 Bucket

Creating a Data Pipeline

Grant access to S3 to IAM Role for Data Pipeline
Set the Data Pipeline to execute Shell Script periodically In Shell script, add OS Middleware, add Python Library, and execute Python Script itself.

Check the processing result of Data Pipeline

Check the stdout log of Data Pipeline and check that the execution result of Python Script is output.

Supplement

Data Pipeline Schedule execution cycle is at least every 15 minutes

Place Python Script on S3 Bucket

Creating an S3 bucket

Create an S3 bucket to put the Python Script. Of course, the existing Bucket can be used. Go to AWS Console → S3 and follow the steps below to create an S3 Bucket.

Select Create Bucket and give it an appropriate name (assuming you created a bucket called datapipeline-python-test)

Upload Python Script to S3 Bucket

Follow the steps below to upload Python Script to S3 Bucket.

Go to the Bucket you created earlier and select ʻActions → ʻUpload to upload the Python Script. Here, it is assumed that the following script called datapipeline_test.py, which simply prints the current time, has been uploaded.

`datapipeline_test.py`



#!/usr/bin/env python
# -*- coding: utf-8 -*-

import datetime
print 'Script run at ' + datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')

Creating a Data Pipeline

Go to AWS Console → Data Pipeline and create a Data Pipeline by following the steps below.

Set the Name appropriately (here, Test Pipeline)
For Source, select Run AWS CLI command in Build using a template
For Schedule, specify 15 minutes for Run every (select ʻonce on pipeline activation` if you want to run it only once)
For Logging of Pipeline Configuration, specify the S3 bucket named datapipeline-python-test created above.
Security / Access and Tag can be left as Default
Set the following Script in AWS CLI command (see the end for a supplement to the contents)

sudo yum -y install python-devel gcc && sudo update-alternatives --set python /usr/bin/python2.7 && curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python ./get-pip.py && pip install boto3 --user && aws s3 cp s3://datapipeline-python-test/datapipeline_test.py ./datapipeline_test.py && cat datapipeline_test.py && python ./datapipeline_test.py

Select ʻEdit Architectwith this setting to create a Data Pipeline once. When created, two IAM Roles are created in the IAM Role:DataPipelineDefaultResourceRole and DataPipelineDefaultRole`.

IAM Role permission settings

Since some privileges are insufficient immediately after creating the IAM Role, grant access privileges to S3 to DataPipelineDefaultResourceRole and DataPipelineDefaultRole. Go to AWS Console → Identity & Access Management → Roles and follow the steps below to grant permissions.

Click DataPipelineDefaultResourceRole
Under Managed Policies → Attach Policy, search for ʻAmazonS3FullAccess, select it, and select ʻAttach Policy.

Set the same permissions for DataPipelineDefaultRole

Data Pipeline Activate

Go to AWS Console → Data Pipeline and activate the Data Pipeline you just created.

Select Test Pipeline
Select ʻActions → ʻActivate

The Data Pipeline's periodic execution is now activated. It runs every 15 minutes, so let's wait for a while.

Check the processing result of Data Pipeline

Go to AWS Console → Data Pipeline, select Test Pipeline, select Stdout in CliActivity → ʻAttempts tab`, and confirm that the current time is output by Python Script.

Supplement about Shell Script

I haven't done much, but I'll supplement the contents of the above ShellScript.

sudo yum -y install python-devel gcc
Additional Middleware is included in the OS (assuming that some Python libraries require gcc etc.). Amazon Linux standard Middleware can be deleted if it is sufficient
sudo update-alternatives --set python /usr/bin/python2.7
Python 2.7 is specified. Avoid Default Python version may cause an error in some libraries
curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python ./get-pip.py
Contains pip. It can be deleted if the standard Python Library is sufficient
pip install boto3 --user
I have an additional library of Python in pip. When using pip, the argument of --user is required because of Permission. If you want to put multiple libraries, you can list pip install requests boto3 numpy --user etc.
aws s3 cp s3://datapipeline-python-test/datapipeline_test.py ./datapipeline_test.py
Copying Python script to Local
cat datapipeline_test.py
The contents of the file downloaded from S3 are displayed, you can delete it if you don't need it.
python ./datapipeline_test.py
Finally, I'm running a Python Script

Skip Alarm on Fail

It is also possible to skip Alarm Email using the function of AWS SNS when Python Script fails. I will omit the explanation of AWS SNS itself, but I will briefly supplement the settings on Data Pipeline.

Go to AWS Console → Data Pipeline, select Test Pipeline, and select ʻEdit Pipeline`
Select ʻAdd an optional field → ʻOn Fail in ʻActivities` in the right pane.
A line saying ʻOn Fail is added to ʻActivities, so select Create new: Action and Defaul Action1 is created.
Since there is a setting item called DefaulAction1 in ʻOthers` in the right pane,
Type to SnsAlarm
Topic Arn created on AWS SNS on Topic Arn
Message is the body of the Alarm Email, and Subject is the Subject of the Alarm Email.
Role is DataPipelineDefaultRole by Default, but select a Role with Permission of ʻAmazonSNSFullAccess`

It is OK if you set. It is possible to fire AWS SNS at the time of Script Fail or Success. Don't forget to give the Role Permission to run Sns.

Finally

If Python Script can be executed periodically with Data Pipeline, there is no need to individually secure / manage Hosts for periodic execution or guarantee execution, and various progress will be made.