Run (provisionally) a Docker image with ShellCommandActivity on AWS Data Pipeline

background purpose

Our team has long used the AWS Data Pipeline (https://aws.amazon.com/jp/datapipeline/) to manage ETL programs in the following ways:

However, after several years of operation, the following problems have emerged.

However, it is difficult to migrate the entire system at once, so we are trying to implement it with the following policy.

I decided to use Data Pipeline once because the purpose is to proceed with the project reasonably without making a big difference in the operation such as batch re-execution method. Therefore, it is not a very clean form, and it may be necessary to use a more appropriate service (AWS Batch or MWAA in AWS), but once the container image is displayed on Data Pipeline. I researched and implemented how to do it.

Also, I found an article "Try running docker container periodically using Data Pipeline", but it is difficult to introduce an ECS cluster for the time being, so start it with Data Pipeline. The method is to let the EC2 instance itself execute Docker.

Implementation method

I implemented it with the following policy.

Implementation of ShellCommandActivity

I implemented it with the following shell script. The following are notes.

I feel like I can write it a little more clearly, but for the time being I'm going with this.

set -eu

account="{AWS account ID}"
region="{Region name}"
repository="{ECR repository name}"
tag="latest"
batch="{Docker arguments}"

sudo yum -y update

#Read environment variables and configuration files from the parameter store
IMAGE_AWS_ACCESS_KEY_ID=`aws ssm get-parameter --name "{Parameter store key name}" \
  --with-decryption --region "${region}" --output text --query Parameter.Value`
IMAGE_AWS_SECRET_ACCESS_KEY=`aws ssm get-parameter --name "{Parameter store key name}" \
  --with-decryption --region "${region}" --output text --query Parameter.Value`
aws ssm get-parameter --name "{Parameter store key name}" --with-decryption \
  --region "${region}" --output text --query Parameter.Value > /tmp/bigquery.json

# `aws ecr get-login-password`Since the method using is from version 2 of AWS CLI, install it
#Since the file remains when DataPipeline is re-executed, an error may occur if there is no delete command.
rm -rf ./aws
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

#Install and launch Docker
sudo amazon-linux-extras install -y docker
sudo service docker start

#Log in to ECR and pull the image
aws ecr get-login-password --region "${region}" | \
  sudo docker login --username AWS --password-stdin "https://${account}.dkr.ecr.${region}.amazonaws.com"
sudo docker pull "${account}.dkr.ecr.${region}.amazonaws.com/${repository}:${tag}"

#Execute by passing environment variables and configuration files
sudo docker run --env "AWS_ACCESS_KEY_ID=${IMAGE_AWS_ACCESS_KEY_ID}" \
  --env "AWS_SECRET_ACCESS_KEY=${IMAGE_AWS_SECRET_ACCESS_KEY}" \
  --env "GOOGLE_APPLICATION_CREDENTIALS=/credential/bigquery.json"
  -v "/tmp/bigquery.json:/credential/bigquery.json:ro" \ 
  "${account}.dkr.ecr.${region}.amazonaws.com/${repository}" "${batch}"

How to set Resource Role

Another problem with the settings is the Resource Role assigned to the EC2 instance. Basically, the process is executed in the Docker image, so you should give the IAM user who uses it the appropriate authority, but you need to give the authority to refer to the ECR and parameter store.

Policy for retrieving data from the parameter store


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ssm:DescribeParameters",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "ssm:GetParameters",
            "Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "ssm:GetParameter",
            "Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:ap-northeast-1:{Account ID}:key/{Decryption key}"
            ]
        }
    ]
}

Summary

With these settings, you can continue container migration without stopping batch management in Data Pipeline as before. It's a long way to go, but I'll do my best.

Recommended Posts

Run (provisionally) a Docker image with ShellCommandActivity on AWS Data Pipeline
Run lambda with custom docker image
Run React on a Docker container
Run the AWS CLI on Docker
How to run a job with docker login in AWS batch
Build an environment with Docker on AWS
Run Ubuntu + ROS with Docker on Mac
Environment construction command memo with Docker on AWS
I built a Code Pipeline with AWS CDK.
Launch Docker image with initial data injected with docker-compose
A quick note on using jshell with the official Docker image of the JDK
Run Scala with GraalVM & make it a native image
Run Pico with docker
Run Payara with Docker
Save image data with AWS_S3 + Ruby on Rails_Active Storage
Run phpunit on Docker
Creating a docker host on AWS using Docker Machine (personal memorandum)
Until you run Quarkus and run docker image on Amazon ECS
Create a Docker image with the Oracle JDK installed (yum
Register your own Docker image with ECR using AWS CLI
I tried running a Docker container on AWS IoT Greengrass 2.0
Run TAO Core with Docker
Run openvpn on Docker (windows)
Install docker on AWS EC2
Run logstash with Docker and try uploading data to Elastic Cloud
Prepare a transcendentally simple PHP & Apache environment on Mac with Docker
Building a haskell environment with Docker + VS Code on Windows 10 Home
Create a Docker Image for redoc-cli and register it on Docker Hub
Run a simple model made with Keras on iOS using CoreML
Install openjdk8 on Docker image (Debian)
Install Docker on AWS Ubunt 20.04 LTS
Deploy a Docker application with Greengrass
Run SSE (Server-Sent-Event) samples on docker
Build a Node.js environment with Docker
Steps to run docker on Mac
Run SQL Server with Docker ToolBox
Run puppeteer-core on Heroku (Docker edition)
To beginners launching Docker on AWS
Build a Minecraft server on AWS
How to run JavaFX on Docker
Run GUI application on Docker container
Deploy Flask's Docker image on Heroku
Operate a honeypot (Dionaea) with Docker
Make JupyterLab run anywhere with docker
Run C binaries on AWS Lambda
Run GPU-required batch processing on AWS
Clone your own web app on GitLab when building a Docker image
Starting with installing Docker on EC2 and running Yellowfin in a container
How to build a Ruby on Rails development environment with Docker (Rails 6.x)
Build a development environment on AWS EC2 with CentOS7 + Nginx + pm2 + Nuxt.js
Created a Docker container image for an OpenLDAP server based on Fedora
How to build a Ruby on Rails development environment with Docker (Rails 5.x)