I want to create a Parquet file even in Ruby


Python's pandas and DataFrame.to_parquet are so good that it's a trend that "Python is for handling parquet files". https://pandas.pydata.org/pandas-docs/version/0.22.0/generated/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet

I found it easy to make it in Ruby, so I'll share it.


You can use the official apache gem. (Note that ≠ red-arrow) https://github.com/apache/arrow/tree/master/ruby/red-parquet


File creation

gem installation

$ gem install red-parquet

Create test file (csv)

$ echo colA,colB > test.csv
$ echo 1,2  >> test.csv

Conversion process on ruby (csv-> parquet)

$ irb
irb(main):001:0> require "parquet"
=> true
irb(main):002:0> table = Arrow::Table.load("./test.csv")
=> #<Arrow::Table:0x7fbb0d3e6708 ptr=0x7fbb0e0a4010>
	 colA	 colB
0	    1	    2
irb(main):003:0> table.save("./test.parquet")
=> true


Raise test.parquet to S3 and check with S3 Select

スクリーンショット 2020-06-08 19.08.04.png

did it! !! (He also does type inference ...!)


If you read this area, it seems that you can operate files even with Ruby unexpectedly. https://www.slideshare.net/kou/datasciencerb

