[JAVA] How to build parquet-tools and merge Parquet files

Note on how to merge multiple Parquet files into one Parquet file

Conclusion

You can easily merge using parquet-tools. However, distributed as a jar does not seem to include hadoop-client, so it seems to be local. Could not be executed.

# java -jar parquet-tools-1.9.0.jar cat test.parquet
org/apache/hadoop/fs/Path

Therefore, you need to build it yourself. I built it referring to the following site. https://www.lancork.net/2016/10/inspect-parquet-files-using-parquet-tools/

Preparation

Use the one set up on the CentOS7.5 server (using GUI)

** 1. Install packages that you may need **

# yum install gcc gcc-c++ java-1.8.0-openjdk-devel boost-devel openssl-devel 

** 2. Install maven **

# wget http://ftp.yz.yamagata-u.ac.jp/pub/network/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
# tar zxvf apache-maven-3.3.9-bin.tar.gz

** 3. Install Thrift **

# wget -nv http://archive.apache.org/dist/thrift/0.12.0/thrift-0.12.0.tar.gz
# tar zxvf thrift-0.12.0.tar.gz
# cd thrift-0.12.0/
# ./configure -disable-gen-erl -disable-gen-hs -without-ruby -without-haskell -without-erlang -without-php -without-nodejs
# make install

** 4. Download the complete source code ** https://github.com/apache/parquet-mr Download the complete source code from the link above.

# unzip parquet-mr-master.zip
# cd parquet-mr-master/

** 5. Build ** Execute the mvn command under [current directory] / parquet-mr-master / I didn't notice ↑ and it was running under [current directory] / parquet-mr-master / parquet-tools, so I was addicted to it.

mvn clean package -Plocal

Merge Parquet files with parquet-tools

Under [current directory] / parquet-mr-master / parquet-tools / target A file called parquet-tools-1.12.0-SNAPSHOT.jar (as of May 8, 2020) is generated. Use this to merge.

Any Parquet file to merge was fine, so I used the file linked below that came out by google. http://anson.ucdavis.edu/~clarkf/

Check if you can see the contents


# java -jar ./parquet-tools-1.12.0-SNAPSHOT.jar cat test.parquet
timeperiod = 01/01/2016 00:00:05
flow1 = 0
occupancy1 = 0.0
speed1 = 0.0
flow2 = 0
occupancy2 = 0.0
speed2 = 0.0
flow3 = 0
occupancy3 = 0.0
speed3 = 0.0

timeperiod = 01/01/2016 00:00:35
flow1 = 0
occupancy1 = 0.0
speed1 = 0.0
flow2 = 0
occupancy2 = 0.0
speed2 = 0.0
flow3 = 0
occupancy3 = 0.0
speed3 = 0.0

···(abridgement)···

Merge


# java -jar ./parquet-tools-1.12.0-SNAPSHOT.jar merge ./data/*.parquet ./merge.parquet
Warning: file data/part-r-00000-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet is too small, length: 16979
Warning: file data/part-r-00001-ddaee723-f3f6-4f25-a34b-3312172aa6d7.snappy.parquet is too small, length: 18350
···(abridgement)···
Warning: you merged too small files. Although the size of the merged file is bigger, it STILL contains small row groups, thus you don't have the advantage of big row groups, which usually leads to bad query performance!

I get a warning that the performance will drop if I merge, but I was able to merge.

that's all.

Recommended Posts

How to build parquet-tools and merge Parquet files
[Java] How to output and write files!
How to build API with GraphQL and Rails
How to build android-midi-lib
How to handle TSV files and CSV files in Ruby
How to rollback migration files
What happened in "Java 8 to Java 11" and how to build an environment
How to check the extension and size of uploaded files
How to use StringBurrer and Arrays.toString.
How to use EventBus3 and ThreadMode
How to disassemble Java class files
How to call classes and methods
How to use equality and equality (how to use equals)
How to connect Heroku and Sequel
How to decompile java class files
How to convert LocalDate and Timestamp
How to build vim on Ubuntu 20.04
How to build CloudStack using Docker
How to install the language used in Ubuntu and how to build the environment
How to place and share SwiftLint config files on the server
How to share files with Docker Toolbox
How to use binding.pry for view files
How to use OrientJS and OrientDB together
Don't know how to build task'credentials: edit'
How to set up and use kapt
How to find the tens and ones
[Easy] How to upgrade Ruby and bundler
Note: [Docker] How to start and stop
How to write and explain Dockerfile, docker-compose
How to use @Builder and @NoArgsConstructor together
How to build Rails 6 environment with Docker
[Ruby] How to convert from lowercase to uppercase and from uppercase to lowercase
How to output Excel and PDF using Excella
[Java] How to use FileReader class and BufferedReader class
[Java] How to get and output standard input
[Ruby] How to use gsub method and sub method
How to play audio and music using javascript
Git and GitHub ~ How to fix common errors ~
How to update pre-built files in docker container
How to set up and operate jEnv (Mac)
How to use scope and pass processing (Jakarta)
How to get and study java SE8 Gold
How to execute processing before and after docker-entrypoint.sh
How to get resource files out with spring-boot
How to find the total score and average score
[Rails] How to build an environment with Docker
[Rails] How to get success and error messages
How to build a Pytorch environment on Ubuntu
How to build the simplest blockchain in Ruby
How to build an executable jar in Maven
Rails "How to delete NO FILE migration files"
Rails scope anti-patterns and how to eliminate them
Ruby How to convert between uppercase and lowercase
How to access Java Private methods and fields
[Java] How to use Calendar class and Date class
How to deploy
How to quit Docker for Mac and build a Docker development environment with Ubuntu + Vagrant
[Part 1] How to deploy Docker containers and static files with CircleCI + ECS + ECR + CloudFront
Cache Gradle dependent files to speed up docker build
How to build Docker + Springboot app (for basic learning)
[Java] Types of comments and how to write them