This time, I used Apache Arrow to investigate the exchange of data between Java and python.
It is expected that data exchange using Apache Arrow can be copied between systems and languages without copying in memory. To write the conclusion first, in this survey, it was exchanged via a byte array in Apache Arrow format. You don't have to write it to your local disk, but Java serialization and python deserialization will result in in-memory data copying. I think the advantage of the method introduced this time is that it is easy to link languages because there is a high-speed common format called Apache Arrow.
Apache Arrow had a library called jvm for exchanging data between Java and python. This library implements a function that if you pass a java Apache Arrow object VectorSchemaRoot, it will convert it to a python RecordBatch.
def record_batch(jvm_vector_schema_root):
"""
Construct a (Python) RecordBatch from a JVM VectorSchemaRoot
Parameters
----------
jvm_vector_schema_root : org.apache.arrow.vector.VectorSchemaRoot
Returns
-------
record_batch: pyarrow.RecordBatch
"""
Therefore, it is assumed that data exchange between Java and python can be easily done by using py4j.
Get the jars you need to use py4j. Bring the arrow and py4j jars locally with the following pom.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>gtest</groupId>
<artifactId>gtest</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>gtest</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
<version>0.12.0</version>
</dependency>
<dependency>
<groupId>net.sf.py4j</groupId>
<artifactId>py4j</artifactId>
<version>0.10.8.1</version>
</dependency>
</dependencies>
</project>
The maven command is as follows.
$ mvn dependency:copy-dependencies -DoutputDirectory=./lib -f ./pom.xml
Create a simple java class by referring to the py4j sample. It also implements a function that returns VecotrSchemaRoot for validation on py4j.
import java.util.List;
import java.util.ArrayList;
import py4j.GatewayServer;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BitVector;
import org.apache.arrow.vector.VectorSchemaRoot;
public class Py4jExample {
public static void main( final String[] args ) {
Py4jExample app = new Py4jExample();
GatewayServer server = new GatewayServer( app );
server.start();
}
/**
*Prepare a function to create a VectorSchemaRoot to try.
*/
public static VectorSchemaRoot create() {
RootAllocator allocator = new RootAllocator( Integer.MAX_VALUE );
BitVector vector = new BitVector( "test" , allocator );
vector.allocateNew( 1 );
vector.setSafe( 0 , 1 );
vector.setValueCount( 1 );
List list = new ArrayList();
list.add( vector );
VectorSchemaRoot root = new VectorSchemaRoot( list );
root.setRowCount( 1 );
return root;
}
}
Compile and start.
$ javac -cp lib/*:. Py4jExample.java
$ java -cp lib/*:. Py4jExample &
Now that we're ready to call from python, let's write the python process.
from py4j.java_gateway import JavaGateway
import pyarrow as pa
import pyarrow.jvm as j
from pyarrow import RecordBatch
gateway = JavaGateway()
root = gateway.jvm.Py4jExample.create()
rb = j.record_batch( root )
df = rb.to_pandas()
I thought it would be easy to convert from java to python record_batch using the jvm library like this ...
$ python test.py
Segmentation fault
You can run it up to the point where you get RecordBatch, but it seems to crash when you call to_pandas (). I can't tell if my environment was bad, but I didn't seem to understand it right away, so I decided to give up this time.
Alternatively, Apache Arrow also implements a binary format for the file, so it can be read and written in Java and python as a byte array. Looking at the current (as of 03/03/2019) python implementation of the jvm, it seems that it is copying in memory when it takes a Java object and converts it to a python object. Therefore, serialization and deserialization are required for data exchange in byte arrays, but only serialization to byte arrays in Java is costly.
Add a function to your Java class that creates a byte array.
/**
*Prepare a function to create an Arrow byte array to try.
*/
public static byte[] createArrowFile() throws IOException {
RootAllocator allocator = new RootAllocator( Integer.MAX_VALUE );
BitVector vector = new BitVector( "test" , allocator );
vector.allocateNew( 1 );
vector.setSafe( 0 , 1 );
vector.setValueCount( 1 );
List list = new ArrayList();
list.add( vector );
VectorSchemaRoot root = new VectorSchemaRoot( list );
root.setRowCount( 1 );
ByteArrayOutputStream out = new ByteArrayOutputStream();
ArrowFileWriter writer = new ArrowFileWriter( root, null, Channels.newChannel( out ) );
writer.start();
writer.writeBatch();
writer.end();
writer.close();
return out.toByteArray();
}
Compile this class and start it again. The processing in python is modified to get RecordBatch from the byte array.
from py4j.java_gateway import JavaGateway
import pyarrow as pa
import pyarrow.jvm as j
from pyarrow import RecordBatch
gateway = JavaGateway()
reader = pa.RecordBatchFileReader( pa.BufferReader( gateway.jvm.Py4jExample.createArrowFile() ) )
rb = reader.get_record_batch(0);
df = rb.to_pandas()
print df
As expected, the result of this execution was that VectorSchemaRoot created in Java could be treated as RecordBatch in python.
$ python test2.py
test
0 True
Data exchange between Java and python has now become a brute force implementation. However, Apache Arrow has the advantage that it is compatible with libraries that process data with python such as pandas, and that you do not have to implement your own serialization and deserialization.
The purpose of this research was to make the file format we are currently developing implemented in Java and also available in python, which is commonly used in data processing. Next time, based on this research, I would like to write an implementation and operation example of reading and writing from python to a file format with a byte array.
Recommended Posts