Run Embulk on Docker to convert files

Let's run Embulk in a Docker container and try converting a csv file to an orc file as a test

Embulk ?

A tool that allows you to batch extract data from a file or database and load it into another storage or database For details, please refer to Official Document or the blog of the developer below.

-Parallel data transfer tool "Embulk" released!

Environmental information

Mac OS X Catalina 10.15.7
Docker version: 20.10.0
orc-tools version: ORC 1.6.6 --Lastly used to see the contents of the orc file -Installable with Homebrew

Manually install Embulk inside the Docker container

For the time being, start up the container and try inserting it manually

docker run -it --rm java:8 bash

Install embulk as described in Official Documentation

--Since it is difficult to pass the path under $ HOME considering that it will be imaged later, the installation location of the executable file should be/usr/local/bin. --So, replace ~/.embulk/bin/embulk in the official procedure with/usr/local/bin/embulk. --Do not update the PATH in the procedure

curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x /usr/local/bin/embulk

Try to move it according to the quick start in the official document

embulk example ./try1
embulk guess ./try1/seed.yml -o config.yml
embulk run config.yml

As a result of embulk run, it is OK if the following result is output to standard output.

1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,

What did you do in ↑

--The contents of try1/csv/sample_01.csv.gz are output to stdout (standard output).

Contents of config.yml

root@7e9764e79b83:/# cat config.yml
in:
  type: file
  path_prefix: /./try1/csv/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: {type: stdout}

Check the contents of try1/csv/sample_01.csv.gz

root@7e9764e79b83:/# zcat try1/csv/sample_*
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

Create Dockerfile & start container

Now that the operation check is complete, convert the above manual procedure into a Dockerfile.

--I wanted to specify / usr/local/vin/embulk in ENTRYPOINT, but I can't start jar directly in ENTRYPOINT, so start it with java -jar.

`Dockerfile`


FROM java:8

# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
    chmod +x /usr/local/bin/embulk

ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]

build

docker build -t embulk .

Verification

docker run -it --rm embulk:latest embulk --version
Embulk v0.9.23

Convert csv file to orc with embulk in container

Since it's a big deal, I will actually use embulk to convert the csv file to an orc file

Modify Docker file

--Add the following to the Docker file earlier -Installation of embulk-output-orc plugin --Install the plugin as embulk gem install $ {package name} --Set WORKDIR to/work

`Dockerfile`


FROM java:8

# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
    chmod +x /usr/local/bin/embulk

RUN embulk gem install embulk-output-orc

WORKDIR /work

ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]

build

docker build -t embulk .

Working directory structure

.
├── Dockerfile
└── work
    ├── config.yml
    ├── csv_inputs
    │   └── sample_01.csv.gz
    └── orc_outputs

Operation image

--Convert ./work/csv_inputs/sample_01.csv.gz to orc according to the settings written in work/config.yml and output under ./work/orc_outputs/

Contents of the work/config.yml file

--Rewriting out in config.yml created in the previous quick start to type: orc --This time, in order to prevent an empty file from being created due to the small amount of data, set min_output_tasks to 1 in exec to prevent the files from being distributed.

`config.yml`


exec:
  min_output_tasks: 1
in:
  type: file
  path_prefix: /work/csv_inputs/sample_
  decoders:
  - {type: gzip}
  parser:
    charset: UTF-8
    newline: LF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    null_string: 'NULL'
    trim_if_not_quoted: false
    skip_header_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: id, type: long}
    - {name: account, type: long}
    - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
    - {name: purchase, type: timestamp, format: '%Y%m%d'}
    - {name: comment, type: string}
out: 
  type: orc
  path_prefix: /work/orc_outputs/sample
  compression_kind: ZLIB
  overwrite:   true

The contents of the input csv file are the same as in the quick start.

#Note that on Mac, zcat is gzcat
gzcat work/csv_inputs/sample_01.csv.gz

id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL

Run

--Mounting WORKDIR (/ work) in the container with ./work

docker run -it --rm -v $(pwd)/work:/work embulk:latest run config.yml

Verification

--orc file is created

ls work/orc_outputs

sample.000.orc

Check the contents of the orc file

-Can be confirmed with orc-tools

orc-tools data work/orc_outputs/sample.000.orc

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/Cellar/orc-tools/1.6.6/libexec/orc-tools-1.6.6-uber.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Processing data file work/orc_outputs/sample.000.orc [length: 787]
{"id":1,"account":32864,"time":"2015-01-27 19:23:49.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk"}
{"id":2,"account":14824,"time":"2015-01-27 19:01:23.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk jruby"}
{"id":3,"account":27559,"time":"2015-01-28 02:20:02.0","purchase":"2015-01-28 00:00:00.0","comment":"Embulk \"csv\" parser plugin"}
{"id":4,"account":11270,"time":"2015-01-29 11:54:36.0","purchase":"2015-01-29 00:00:00.0","comment":null}
________________________________________________________________________________________________________________________

It has been converted!

References

--embulk official documentation

https://www.embulk.org/ -Build an environment with Docker to input data to MySQL using Embulk -Run Embulk in Docker container -Memo the execution environment (Docker) of the parallel data transfer tool embulk and how to use it because I made a sample