When downloading a large number of Objects from S3, the speed does not come out moderately regardless of the Object size. When I was writing in Python, I was doing my best with ** concurrent.futures ** etc., but maybe I can do it with Goroutine? I thought, I made my debut in Golang.
--Use ListObjectV2 to get all Keys under the specific Prefix of S3 --Download the acquired Key with Goroutine
--I tried writing the ListObject part, but to be clear, it's slow. ――What do you think it is faster to write in Python?
Hmm? Same speed because you just hit the API. If so, I'm still convinced, but I'm a little worried that ** Go is slower than scripting languages **. I changed my schedule in a hurry and tried to verify this matter a little.
So here and [here](https: // I will write it with reference to docs.aws.amazon.com/sdk-for-go/api/service/s3/#S3.ListObjectsV2).
main.go
package main
import (
"fmt"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/s3"
"os"
)
func main() {
bucket := os.Getenv("BUCKET")
prefix := os.Getenv("PREFIX")
region := os.Getenv("REGION")
sess := session.Must(session.NewSession())
svc := s3.New(sess, &aws.Config{
Region: ®ion,
})
params := &s3.ListObjectsV2Input{
Bucket: &bucket,
Prefix: &prefix,
}
fmt.Println("Start:")
err := svc.ListObjectsV2Pages(params,
func(p *s3.ListObjectsV2Output, last bool) (shouldContinue bool) {
for _, obj := range p.Contents {
fmt.Println(*obj.Key)
}
return true
})
fmt.Println("End:")
if err != nil {
fmt.Println(err.Error())
return
}
}
I will write this as well. With a low level client to match the conditions with Go.
main.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
import boto3
bucket = os.environ["BUCKET"]
prefix = os.environ["PREFIX"]
region = os.environ["REGION"]
# r = boto3.resource('s3').Bucket(bucket).objects.filter(Prefix=prefix)
# [print(r.key) for r in r]
#I usually get it as above, but I measure it with the following code to send it to Golang
s3_client = boto3.client('s3', region)
contents = []
next_token = ''
while True:
if next_token == '':
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)
else:
response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix, ContinuationToken=next_token)
contents.extend(response['Contents'])
if 'NextContinuationToken' in response:
next_token = response['NextContinuationToken']
else:
break
[print(r["Key"]) for r in contents]
--Basically run on Cloud9 on EC2 (t2.micro).
――I don't want to pollute the environment and it's troublesome, so I built everything with Docker.
$ docker-compose up -d --build
--By the way, see below for construction materials.
Dockerfile
FROM golang:1.13.5-stretch as build
RUN go get \
github.com/aws/aws-sdk-go/aws \
github.com/aws/aws-sdk-go/aws/session \
github.com/aws/aws-sdk-go/service/s3
COPY . /work
WORKDIR /work
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o main main.go
FROM python:3.7.6-stretch as release
RUN pip install boto3
COPY --from=build /work/main /usr/local/bin/main
COPY --from=build /work/main.py /usr/local/bin/main.py
WORKDIR /usr/local/bin/
docker-compose.yml
version: '3'
services:
app:
build:
context: .
container_name: "app"
tty: True
environment:
BUCKET: <Bucket>
PREFIX: test/
REGION: ap-northeast-1
Create one bucket in the Tokyo region and create about 1000 with the following tools.
#/bin/bash
Bucket=<Bucket>
Prefix="test"
#Create test file
dd if=/dev/zero of=testobj bs=1 count=30
#Copy master file
aws s3 cp testobj s3://${Bucket}/${Prefix}/testobj
#Duplicate master file
for i in $(seq 0 9); do
for k in $(seq 0 99); do
aws s3 cp s3://${Bucket}/${Prefix}/testobj s3://${Bucket}/${Prefix}/${i}/${k}/${i}_${k}.obj
done
done
$ time docker-compose exec app ./main
~ Abbreviation ~
real 0m21.888s
user 0m0.580s
sys 0m0.107s
$ time docker-compose exec app ./main.py
~ Abbreviation ~
real 0m2.671s
user 0m0.577s
sys 0m0.104s
Go is 10 times slower than Python. why!
--Let's increase the number of objects a little more. For the time being, around 10,000 cases.
#Difference only
for i in $(seq 0 99); do
for k in $(seq 0 99); do
――By the way, it took 3-4 hours to complete the upload. I should have made the tool properly ...
$ time docker-compose exec app ./main
~ Abbreviation ~
real 0m23.276s
user 0m0.617s
sys 0m0.128s
$ time docker-compose exec app ./main.py
~ Abbreviation ~
real 0m5.973s
user 0m0.576s
sys 0m0.114s
This time the difference is about 4 times. Rather, it seems that there is a difference of about 18 seconds regardless of the number of objects. Hmmm.
――I feel that I don't understand the language specifications because of the library settings, so I'd like to get some more information. ――If the efficiency of ** parallel download processing ** in Goroutine, which is the original purpose, is good, it seems that there is an error of about 20 seconds, so I will implement the rest.
--If you look closely, user and sys are about the same, so I / O is suspicious in S3. --When I roughly print debug the go code ("Start:", "End:"), it seems that the list object occupies most of the processing time. Is the default value of S3 setting different from boto3? ――Since it runs in the same container, I don't think the CPU credit problem of T-type instances or the difference in NW bandwidth are related ... ――The former was not solved even if you replaced it with m5.large, so it doesn't matter.
As advised in the comment section, I tried debugging the SDK I found that it took a long time to find the IAM Credential. Was the default value of the OS of "-stretch" bad? I tried it several times after that, but it didn't reappear in this environment, so I'll try to solve it. It's moody, but ...
@nabeken Thank you!