Our team has long used the AWS Data Pipeline (https://aws.amazon.com/jp/datapipeline/) to manage ETL programs in the following ways:
However, after several years of operation, the following problems have emerged.
However, it is difficult to migrate the entire system at once, so we are trying to implement it with the following policy.
I decided to use Data Pipeline once because the purpose is to proceed with the project reasonably without making a big difference in the operation such as batch re-execution method. Therefore, it is not a very clean form, and it may be necessary to use a more appropriate service (AWS Batch or MWAA in AWS), but once the container image is displayed on Data Pipeline. I researched and implemented how to do it.
Also, I found an article "Try running docker container periodically using Data Pipeline", but it is difficult to introduce an ECS cluster for the time being, so start it with Data Pipeline. The method is to let the EC2 instance itself execute Docker.
I implemented it with the following policy.
ami-00f045aed21a55240
64-bit x86)I implemented it with the following shell script. The following are notes.
aws ssm get-parameter
, refer to this article.AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
. You need to create an IAM user with the appropriate privileges in advanceset -eu
, processing can be stopped at the line where the error occurred. For details, refer to "Set -eu when writing a shell script".I feel like I can write it a little more clearly, but for the time being I'm going with this.
set -eu
account="{AWS account ID}"
region="{Region name}"
repository="{ECR repository name}"
tag="latest"
batch="{Docker arguments}"
sudo yum -y update
#Read environment variables and configuration files from the parameter store
IMAGE_AWS_ACCESS_KEY_ID=`aws ssm get-parameter --name "{Parameter store key name}" \
--with-decryption --region "${region}" --output text --query Parameter.Value`
IMAGE_AWS_SECRET_ACCESS_KEY=`aws ssm get-parameter --name "{Parameter store key name}" \
--with-decryption --region "${region}" --output text --query Parameter.Value`
aws ssm get-parameter --name "{Parameter store key name}" --with-decryption \
--region "${region}" --output text --query Parameter.Value > /tmp/bigquery.json
# `aws ecr get-login-password`Since the method using is from version 2 of AWS CLI, install it
#Since the file remains when DataPipeline is re-executed, an error may occur if there is no delete command.
rm -rf ./aws
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update
#Install and launch Docker
sudo amazon-linux-extras install -y docker
sudo service docker start
#Log in to ECR and pull the image
aws ecr get-login-password --region "${region}" | \
sudo docker login --username AWS --password-stdin "https://${account}.dkr.ecr.${region}.amazonaws.com"
sudo docker pull "${account}.dkr.ecr.${region}.amazonaws.com/${repository}:${tag}"
#Execute by passing environment variables and configuration files
sudo docker run --env "AWS_ACCESS_KEY_ID=${IMAGE_AWS_ACCESS_KEY_ID}" \
--env "AWS_SECRET_ACCESS_KEY=${IMAGE_AWS_SECRET_ACCESS_KEY}" \
--env "GOOGLE_APPLICATION_CREDENTIALS=/credential/bigquery.json"
-v "/tmp/bigquery.json:/credential/bigquery.json:ro" \
"${account}.dkr.ecr.${region}.amazonaws.com/${repository}" "${batch}"
Another problem with the settings is the Resource Role assigned to the EC2 instance. Basically, the process is executed in the Docker image, so you should give the IAM user who uses it the appropriate authority, but you need to give the authority to refer to the ECR and parameter store.
Policy for retrieving data from the parameter store
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "ssm:DescribeParameters",
"Resource": "*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "ssm:GetParameters",
"Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": "ssm:GetParameter",
"Resource": "arn:aws:ssm:ap-northeast-1:{Account ID}:parameter/{Appropriate parameter store hierarchy}/*"
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": [
"arn:aws:kms:ap-northeast-1:{Account ID}:key/{Decryption key}"
]
}
]
}
With these settings, you can continue container migration without stopping batch management in Data Pipeline as before. It's a long way to go, but I'll do my best.
Recommended Posts