background purpose

Our team has long used the AWS Data Pipeline (https://aws.amazon.com/jp/datapipeline/) to manage ETL programs in the following ways:

Prepare an AMI containing Ruby, Python and necessary tools
The ETL program is managed by Gitlab on the team management server, and when pushing to the master branch, it is zipped with Jenkins and uploaded to S3.
Start the EC2 server with Shell Command Activity of Data Pipeline, download the source code from S3 and execute it.

However, after several years of operation, the following problems have emerged.

It is troublesome to upgrade the AMI version, and if possible, you want to containerize it and manage it with Dockerfile. Data Pipeline itself does not have the ability to run Docker images
Github Organization has become a standard company-wide, but it is not compliant, and Gitlab and Jenkins managed independently are burdened.
(This is bad for us) Data Pipeline is just using the cron function and batch dependencies cannot be defined.

However, it is difficult to migrate the entire system at once, so we are trying to implement it with the following policy.

First, containerize (especially new) batch processing, move to Github, and push to ECR with Github Actions
For now, let Data Pipeline run the container image
After that, Data Pipeline itself will later move to a form that utilizes an appropriate workflow engine and container execution platform.

I decided to use Data Pipeline once because the purpose is to proceed with the project reasonably without making a big difference in the operation such as batch re-execution method. Therefore, it is not a very clean form, and it may be necessary to use a more appropriate service (AWS Batch or MWAA in AWS), but once the container image is displayed on Data Pipeline. I researched and implemented how to do it.

Also, I found an article "Try running docker container periodically using Data Pipeline", but it is difficult to introduce an ECS cluster for the time being, so start it with Data Pipeline. The method is to let the EC2 instance itself execute Docker.

Implementation method

I implemented it with the following policy.

As a prerequisite, using Github Actions, the Docker image itself has already been pushed to ECR
AMI uses plain Amazon Linux 2 (id is ami-00f045aed21a55240 64-bit x86)
Credentials etc. are saved in the parameter store in kms encrypted form, acquired from there and passed to the image in environment variables and file formats.
For more information, read the official documentation, "How to use AWS KMS with the AWS Systems Manager Parameter Store" (https://docs.aws.amazon.com/ja_jp/kms/latest/developerguide/services-parameter-store.html).
As an aside, I also received the advice that "if you have the spare capacity, you should pass it as a command line option rather than an environment variable. Generally, the command line parser has better validation."

Implementation of ShellCommandActivity

I implemented it with the following shell script. The following are notes.

For the option to get only the text with aws ssm get-parameter, refer to this article.
Since most batches use BigQuery, we spit out GCP credentials into a file and pass them in the image.
When using awscli or AWS SDK in a Docker image, you need to pass it in the container as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. You need to create an IAM user with the appropriate privileges in advance
By executing set -eu, processing can be stopped at the line where the error occurred. For details, refer to "Set -eu when writing a shell script".
Please refer to the Official Document as the installation method of awscli2 differs depending on whether it is x86 or ARM.

I feel like I can write it a little more clearly, but for the time being I'm going with this.

set -eu

account="{AWS account ID}"
region="{Region name}"
repository="{ECR repository name}"
tag="latest"
batch="{Docker arguments}"

sudo yum -y update

#Read environment variables and configuration files from the parameter store
IMAGE_AWS_ACCESS_KEY_ID=`aws ssm get-parameter --name "{Parameter store key name}" \
  --with-decryption --region "${region}" --output text --query Parameter.Value`
IMAGE_AWS_SECRET_ACCESS_KEY=`aws ssm get-parameter --name "{Parameter store key name}" \
  --with-decryption --region "${region}" --output text --query Parameter.Value`
aws ssm get-parameter --name "{Parameter store key name}" --with-decryption \
  --region "${region}" --output text --query Parameter.Value > /tmp/bigquery.json

# `aws ecr get-login-password`Since the method using is from version 2 of AWS CLI, install it
#Since the file remains when DataPipeline is re-executed, an error may occur if there is no delete command.
rm -rf ./aws
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

#Install and launch Docker
sudo amazon-linux-extras install -y docker
sudo service docker start

#Log in to ECR and pull the image
aws ecr get-login-password --region "${region}" | \
  sudo docker login --username AWS --password-stdin "https://${account}.dkr.ecr.${region}.amazonaws.com"
sudo docker pull "${account}.dkr.ecr.${region}.amazonaws.com/${repository}:${tag}"

#Execute by passing environment variables and configuration files
sudo docker run --env "AWS_ACCESS_KEY_ID=${IMAGE_AWS_ACCESS_KEY_ID}" \
  --env "AWS_SECRET_ACCESS_KEY=${IMAGE_AWS_SECRET_ACCESS_KEY}" \
  --env "GOOGLE_APPLICATION_CREDENTIALS=/credential/bigquery.json"
  -v "/tmp/bigquery.json:/credential/bigquery.json:ro" \ 
  "${account}.dkr.ecr.${region}.amazonaws.com/${repository}" "${batch}"

How to set Resource Role

Another problem with the settings is the Resource Role assigned to the EC2 instance. Basically, the process is executed in the Docker image, so you should give the IAM user who uses it the appropriate authority, but you need to give the authority to refer to the ECR and parameter store.

Policies required for logging of DataPipeline itself such as AmazonEC2RoleforDataPipelineRole
Policies for pulling from ECS such as AmazonEC2ContainerRegistryReadOnly
Role for retrieving data from the parameter store (listed below)
For decryption, you also need to allow the kms used for encryption
This is a particularly important piece of information, so we have strictly set permissions here.

`Policy for retrieving data from the parameter store`


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ssm:DescribeParameters",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "ssm:GetParameters",
            "Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "ssm:GetParameter",
            "Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": [
                "arn:aws:kms:ap-northeast-1:{Account ID}:key/{Decryption key}"
            ]
        }
    ]
}

Summary

With these settings, you can continue container migration without stopping batch management in Data Pipeline as before. It's a long way to go, but I'll do my best.

Run (provisionally) a Docker image with ShellCommandActivity on AWS Data Pipeline