ETL is an acronym for extract, transform, and load. Extract, transform, and load are literally translated into Japanese for extraction, processing, and loading. ETL processing is the processing indicated by ETL for some data (text file, csv file, etc ...).
BrainPad Design and implement the common infrastructure part of the ETL processing function that was developed and operated in-house as an application framework. The fixed one is clivoa.
GitHub https://github.com/BrainPad/cliboa
PyPI https://pypi.org/project/cliboa/
In cliboa, extract is defined as downloading data from some box, transform is defined as processing the downloaded data, and load is defined as extracting and uploading the processed data to any box. are doing. A conceptual diagram is shown below.
--Implemented in Python. --It is possible to run a simple ETL process just by writing a Yaml file. --Additional implementation by Python is possible.
It runs on Linux OS such as Debian, Ubuntu and CentOS.
After preparing python version3.0 or higher, install it with the pip command.
sudo pip3 install cliboa
After the installation is complete, you can run the command clivoadmin. Run cliboadmin in any directory.
$ cd /usr/local
$ cliboadmin init sample
$ cd sample
$ cliboadmin create simple-etl
The program structure initialized by cliboadmin is as follows.
sample
|-- bin
| `-- clibomanager.py
|-- common
| |-- __init__.py
| |-- environment.py
| |-- scenario
| `-- scenario.yml
|-- conf
|-- logs
|-- project
| `-- simple-etl
| |-- scenario
| `-- scenario.yml
`-- requirements.txt
Since the set of python packages required to run clivoa is defined in requirements.txt, specify it with the pip command and install it.
$ cd sample
$ pip3 install -r requirements.txt
Write the following process as an example in project / simple-etl / scenario.yml.
Processing content Download test.csv.gz from sftp server, unzip the downloaded file, upload the unzipped test.csv to sftp server
scenario:
- step:
class: SftpDownload
arguments:
host: localhost
user: root
password: pass
src_dir: /usr/local
src_pattern: test.csv.gz
dest_dir: /tmp
- step: FileDecompress
arguments:
src_dir: /tmp
src_pattern: test.*\.csv.*\.gz
- step:
class: SftpUpload
arguments:
host: localhost
user: root
password: pass
src_dir: /tmp
src_pattern: test.*\.csv
dest_dir: /usr/local
Prepare the following before execution
Execute with the following command
cd sample
bin/clibomanager.py simple-etl
After execution, if it looks like the following, it succeeds --test.csv.gz placed under / usr / local is expanded under / tmp and becomes test.csv. --test.csv exists under / usr / local
Recommended Posts