(Translation) From Arrow to pandas at 10GB / s

INTRODUCTION: Wes McKinney, the author of pandas, wrote a very interesting blog about Python data tools, so I asked if I could translate it and publish it to the Japanese PyData community. I received this, so I will translate it little by little and publish it.

From Arrow to pandas at 10GB / s

(Original: http://wesmckinney.com/blog/high-perf-arrow-to-pandas/)

2016/12/27

This post describes recent work with Apache Arrow to enable fast conversion of generic Arrow columnar memory into pandas objects.

Challenges in building pandas DataFrame objects at high speed

One of the difficulties in building pandas DataFrame objects at high speed is that the "native" internal memory structure is more complex than a dictionary or a list of one-dimensional NumPy arrays. I won't go into the reasons for that complexity here, but I hope I can manage this in the work of pandas 2.0. There are two layers to this complexity:

--Some pandas data types can change their memory representation depending on the presence or absence of a null value. Boolean data will be dtype = object, but integer will be dtype = float64. --When calling pandas.DataFrame, pandas internally "combines" the input data by copying it into an internal 2D block structure. Building a precise block structure is the only real way to build a DataFrame with zero copies.

Let's take a look at the benchmark to get an idea of the overhead created by the compilation. Consider the setup code that creates a dictionary of 100 float64 arrays containing 1GB of data.

import numpy as np
import pandas as pd
import pyarrow as pa

type_ = np.dtype('float64')
DATA_SIZE = (1 << 30)
NCOLS = 100
NROWS = DATA_SIZE / NCOLS / np.dtype(type_).itemsize

data = {
    'c' + str(i): np.random.randn(NROWS)
    for i in range(NCOLS)
}

Then create a DataFrame with pd.DataFrame (data):

>>> %timeit df = pd.DataFrame(data)
10 loops, best of 3: 132 ms per loop

(For those who are trying to calculate, this is 7.58 GB / sec, and all they are doing is a copy of the internal memory)

The important thing here is that at this point panda's "native" memory representation has been built (null will be NaN in the array), but it's a collection of one-dimensional arrays. is.

Arrow columnar memory to pandas conversion

I've been deeply involved in this project since 2016, when Apache Arrow was born. Apache Arrow is a language-independent, in-memory columnar representation and interprocess communication (IPC) tools. Apache Arrow supports nested JSON-like data and is designed to be a building block for building fast analytics engines.

Compared to pandas, Arrow can clearly represent null values in bitmaps that are separate from the values. So even a zero-copy conversion to a dictionary of arrays suitable for pandas requires further processing.

One of my main goals in working with Arrow is to use it as a wideband I / O pipe for the Python ecosystem. Interactions with the JVM, database systems, and various file formats can be achieved by using Arrow as a columnar exchange format. In this use case, it's important to be able to revert to the pandas DataFrame as fast as possible.

Last month I finished engineering to build Arrow's memory in wide bandwidth from pandas' internal block structure. If you're looking at Feather's file format, you'll see that all this process is closely related.

Let's go back to the same gigabytes of data as before and add some nulls.

>>> df = pd.DataFrame(data)
>>> df.values[::5] = np.nan

Now let's convert this DataFrame to an Arrow table. This builds a columnar representation of Arrow.

>>> table = pa.Table.from_pandas(df)
>>> table
<pyarrow.table.Table at 0x7f18ec65abd0>

To switch back to pandas, call the table's to_pandas method. It supports multi-threaded conversions, so let's do a single-threaded conversion and compare.

>>> %timeit df2 = table.to_pandas(nthreads=1)
10 loops, best of 3: 158 ms per loop

That's 6.33 GB / s, which is about 20% slower than building on a pure memcpy base. On my desktop, I can speed up with all four cores.

>>> %timeit df2 = table.to_pandas(nthreads=4)
10 loops, best of 3: 103 ms per loop

9.71 GB / sec isn't at all a situation where I'm running out of main memory bandwidth on my consumer desktop hardware (although I'm not an expert around here).

The performance gains from multithreading may be even more dramatic on different hardware. The performance ratio on my desktop is only 1.53, but on my laptop (which is also a quad core) it is 3.29.

However, keep in mind that working with numerical data is the best case. For string and binary data, pandas will continue to use Python objects in memory representation, but it will still add some overhead.

Impact on the future and roadmap

You can now easily build an array of Arrows, a batch of records (multiple arrays of the same length), or a table (a collection of batches of records) from a variety of sources with zero copies, so this method is flexible. And now you can efficiently move tabular data between systems. Now that you can convert to pandas at high speed, you can ignore a complete pandas DataFrame at a small conversion cost (5-10GB / sec is usually ignored compared to the I / O performance of other media. You can) You can now get it.

I'll write another post about the technical details of Arrow's C ++ I / O subsystem, which has low overhead (and processes with zero copies as much as possible).

As we move along the pandas 2.0 roadmap, we hope to further reduce (and in some cases zero) the overhead of interacting with columnar memory like Arrow. The simpler memory representation will also make it easier for other applications to interact with pandas at a low level.

Recommended Posts

(Translation) From Arrow to pandas at 10GB / s
[python] Create table from pandas DataFrame to postgres
Sum from 1 to 10
Copy files directly from Amazon EC2 (Amazon linux) to S3
Copy S3 files from Python to GCS using GSUtil