Let's run Embulk in a Docker container and try converting a csv file to an orc file as a test
Embulk ?
A tool that allows you to batch extract data from a file or database and load it into another storage or database For details, please refer to Official Document or the blog of the developer below.
-Parallel data transfer tool "Embulk" released!
For the time being, start up the container and try inserting it manually
docker run -it --rm java:8 bash
Install embulk as described in Official Documentation
--Since it is difficult to pass the path under $ HOME
considering that it will be imaged later, the installation location of the executable file should be/usr/local/bin
.
--So, replace ~/.embulk/bin/embulk
in the official procedure with/usr/local/bin/embulk
.
--Do not update the PATH in the procedure
curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
chmod +x /usr/local/bin/embulk
Try to move it according to the quick start in the official document
embulk example ./try1
embulk guess ./try1/seed.yml -o config.yml
embulk run config.yml
As a result of embulk run
, it is OK if the following result is output to standard output.
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,Embulk "csv" parser plugin
4,11270,2015-01-29 11:54:36,20150129,
What did you do in ↑
--The contents of try1/csv/sample_01.csv.gz
are output to stdout (standard output).
Contents of config.yml
root@7e9764e79b83:/# cat config.yml
in:
type: file
path_prefix: /./try1/csv/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out: {type: stdout}
Check the contents of try1/csv/sample_01.csv.gz
root@7e9764e79b83:/# zcat try1/csv/sample_*
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL
Now that the operation check is complete, convert the above manual procedure into a Dockerfile.
--I wanted to specify / usr/local/vin/embulk
in ENTRYPOINT, but I can't start jar directly in ENTRYPOINT, so start it with java -jar
.
Dockerfile
FROM java:8
# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
chmod +x /usr/local/bin/embulk
ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]
build
docker build -t embulk .
Verification
docker run -it --rm embulk:latest embulk --version
Embulk v0.9.23
Since it's a big deal, I will actually use embulk to convert the csv file to an orc file
Modify Docker file
--Add the following to the Docker file earlier
-Installation of embulk-output-orc plugin
--Install the plugin as embulk gem install $ {package name}
--Set WORKDIR to/work
Dockerfile
FROM java:8
# Install embulk
RUN curl --create-dirs -o /usr/local/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar" &&\
chmod +x /usr/local/bin/embulk
RUN embulk gem install embulk-output-orc
WORKDIR /work
ENTRYPOINT ["java", "-jar", "/usr/local/bin/embulk"]
build
docker build -t embulk .
Working directory structure
.
├── Dockerfile
└── work
├── config.yml
├── csv_inputs
│ └── sample_01.csv.gz
└── orc_outputs
Operation image
--Convert ./work/csv_inputs/sample_01.csv.gz
to orc according to the settings written in work/config.yml
and output under ./work/orc_outputs/
Contents of the work/config.yml
file
--Rewriting out
in config.yml
created in the previous quick start to type: orc
--This time, in order to prevent an empty file from being created due to the small amount of data, set min_output_tasks
to 1 in exec
to prevent the files from being distributed.
config.yml
exec:
min_output_tasks: 1
in:
type: file
path_prefix: /work/csv_inputs/sample_
decoders:
- {type: gzip}
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
null_string: 'NULL'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: account, type: long}
- {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
- {name: purchase, type: timestamp, format: '%Y%m%d'}
- {name: comment, type: string}
out:
type: orc
path_prefix: /work/orc_outputs/sample
compression_kind: ZLIB
overwrite: true
The contents of the input csv file are the same as in the quick start.
#Note that on Mac, zcat is gzcat
gzcat work/csv_inputs/sample_01.csv.gz
id,account,time,purchase,comment
1,32864,2015-01-27 19:23:49,20150127,embulk
2,14824,2015-01-27 19:01:23,20150127,embulk jruby
3,27559,2015-01-28 02:20:02,20150128,"Embulk ""csv"" parser plugin"
4,11270,2015-01-29 11:54:36,20150129,NULL
Run
--Mounting WORKDIR (/ work)
in the container with ./work
docker run -it --rm -v $(pwd)/work:/work embulk:latest run config.yml
Verification
--orc file is created
ls work/orc_outputs
sample.000.orc
Check the contents of the orc file
-Can be confirmed with orc-tools
orc-tools data work/orc_outputs/sample.000.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/usr/local/Cellar/orc-tools/1.6.6/libexec/orc-tools-1.6.6-uber.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Processing data file work/orc_outputs/sample.000.orc [length: 787]
{"id":1,"account":32864,"time":"2015-01-27 19:23:49.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk"}
{"id":2,"account":14824,"time":"2015-01-27 19:01:23.0","purchase":"2015-01-27 00:00:00.0","comment":"embulk jruby"}
{"id":3,"account":27559,"time":"2015-01-28 02:20:02.0","purchase":"2015-01-28 00:00:00.0","comment":"Embulk \"csv\" parser plugin"}
{"id":4,"account":11270,"time":"2015-01-29 11:54:36.0","purchase":"2015-01-29 00:00:00.0","comment":null}
________________________________________________________________________________________________________________________
It has been converted!
--embulk official documentation
Recommended Posts