Run Embulk on Docker to convert files

Let's run Embulk in a Docker container and try converting a csv file to an orc file as a test

Embulk ?

A tool that allows you to batch extract data from a file or database and load it into another storage or database For details, please refer to Official Document or the blog of the developer below.

-Parallel data transfer tool "Embulk" released!

Environmental information

Manually install Embulk inside the Docker container

For the time being, start up the container and try inserting it manually

docker run -it --rm java:8 bash

Install embulk as described in Official Documentation

--Since it is difficult to pass the path under $ HOME considering that it will be imaged later, the installation location of the executable file should be/usr/local/bin. --So, replace ~/.embulk/bin/embulk in the official procedure with/usr/local/bin/embulk. --Do not update the PATH in the procedure

curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x /usr/local/bin/embulk

Try to move it according to the quick start in the official document

embulk example ./try1
embulk guess ./try1/seed.yml -o config.yml
embulk run config.yml

As a result of embulk run, it is OK if the following result is output to standard output.

1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,

What did you do in ↑

--The contents of try1/csv/sample_01.csv.gz are output to stdout (standard output).

Contents of config.yml

root@7e9764e79b83:/# cat config.yml
in:
  type: file
  path_prefix: /./try1/csv/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

Check the contents of try1/csv/sample_01.csv.gz

root@7e9764e79b83:/# zcat try1/csv/sample_*
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

Create Dockerfile & start container

Now that the operation check is complete, convert the above manual procedure into a Dockerfile.

--I wanted to specify / usr/local/vin/embulk in ENTRYPOINT, but I can't start jar directly in ENTRYPOINT, so start it with java -jar.

Dockerfile


FROM java:8

# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
    chmod +x /usr/local/bin/embulk

ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]

build

docker build -t embulk .

Verification

docker run -it --rm embulk:latest embulk --version
Embulk v0.9.23

Convert csv file to orc with embulk in container

Since it's a big deal, I will actually use embulk to convert the csv file to an orc file

Modify Docker file

--Add the following to the Docker file earlier -Installation of embulk-output-orc plugin --Install the plugin as embulk gem install $ {package name} --Set WORKDIR to/work

Dockerfile


FROM java:8

# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
    chmod +x /usr/local/bin/embulk

RUN embulk gem install embulk-output-orc

WORKDIR /work

ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]

build

docker build -t embulk .

Working directory structure

.
├── Dockerfile
└── work
    ├── config.yml
    ├── csv_inputs
    │   └── sample_01.csv.gz
    └── orc_outputs

Operation image

--Convert ./work/csv_inputs/sample_01.csv.gz to orc according to the settings written in work/config.yml and output under ./work/orc_outputs/

Contents of the work/config.yml file

--Rewriting out in config.yml created in the previous quick start to type: orc --This time, in order to prevent an empty file from being created due to the small amount of data, set min_output_tasks to 1 in exec to prevent the files from being distributed.

config.yml


exec:
  min_output_tasks: 1
in:
  type: file
  path_prefix: /work/csv_inputs/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: 
  type: orc
  path_prefix: /work/orc_outputs/sample
  compression_kind: ZLIB
  overwrite:   true

The contents of the input csv file are the same as in the quick start.

#Note that on Mac, zcat is gzcat
gzcat work/csv_inputs/sample_01.csv.gz

id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

Run

--Mounting WORKDIR (/ work) in the container with ./work

docker run -it --rm -v $(pwd)/work:/work embulk:latest run config.yml

Verification

--orc file is created

ls work/orc_outputs

sample.000.orc

Check the contents of the orc file

-Can be confirmed with orc-tools

orc-tools data work/orc_outputs/sample.000.orc

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/Cellar/orc-tools/1.6.6/libexec/orc-tools-1.6.6-uber.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Processing data file work/orc_outputs/sample.000.orc [length: 787]
{"id":1,"account":32864,"time":"2015-01-27 19:23:49.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk"}
{"id":2,"account":14824,"time":"2015-01-27 19:01:23.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk jruby"}
{"id":3,"account":27559,"time":"2015-01-28 02:20:02.0","purchase":"2015-01-28 00:00:00.0","comment":"Embulk \"csv\" parser plugin"}
{"id":4,"account":11270,"time":"2015-01-29 11:54:36.0","purchase":"2015-01-29 00:00:00.0","comment":null}
________________________________________________________________________________________________________________________

It has been converted!

References

--embulk official documentation

Recommended Posts

Run Embulk on Docker to convert files
Steps to run docker on Mac
How to run JavaFX on Docker
Run phpunit on Docker
Run VS Code on Docker
Run openvpn on Docker (windows)
Run lilypond on Docker on macOS Catalina to create sheet music
Run SSE (Server-Sent-Event) samples on docker
To beginners launching Docker on AWS
Run React on a Docker container
Run the AWS CLI on Docker
Run GUI application on Docker container
Run PureScript on a Docker container
Deploy Rails on Docker to heroku
[Core ML] Convert Cycle GAN to Core ML and run it on iOS
Windows Docker: Disk pressure on WSL files
How to share files with Docker Toolbox
Run NordVPN on Docker (Windows) Ubuntu container
Convert SVG files to PNG files in Java
[Docker] Copy files from docker container to host
Until you run apache on ubuntu on docker
Run Ubuntu + ROS with Docker on Mac
How to run Blazor (C #) with Docker
Steps to register Java files on GitHub
Run GUI application on Docker container (Japanese input)
Liberty on Docker
Settings to bypass Docker Hub restrictions on CircleCI
Copy files from docker container to host (docker cp)
Run JSP Hello World with Tomcat on Docker
I made a Docker container to run Maven
How to update pre-built files in docker container
How to run NullpoMino 7.5.0 on Ubuntu 20.04.1 64bit version
Redmine on Docker
Launch Docker from Java to convert Office documents to PDF
Easy setup to run docker command without sudo (Linux)
Convert large XLSX files to CSV with Apache POI
How to run javafx with Raspberry Pi Posted on 2020/07/12
Run Rubocop and RSpec on CircleCI and deploy to ECS
List how to learn from Docker to AKS on AWS
Create a Docker container to convert EPS to PGF source
Run the Android emulator on Docker using Android Emulator Container Scripts
Introduce RMagick to convert existing existing image files to another format
How to run Java EE Tutial on github on Eclipse
Add files to jar files
Introduction to JAR files
Docker installation on CentOS 6
python notes on docker
Test run on rails
Run Pico with docker
Run PostgreSQL on Java
Install Docker on Manjaro
How to install Docker
Run Payara with Docker
[Docker] Convert yml ⇆ json
Run Processing on Ant
Run tiscamera on Ubuntu 18.04
M.S. docker on Windows
Docker installation on WSL2
Until you run Quarkus and run docker image on Amazon ECS
The key to running Docker on Raspberry Pi 4 (Ubuntu server 20.04)
[Note] Flow from docker installation to JupyterLab startup on ubuntu