Comparison of data frame handling in Python (pandas), R, Pig

Yesterday I touched Pig with the title Basic grammar by Apache Pig (1), so today is of course the basic grammar ( 2) I wondered if it was, but this area of the reference link of the article of the day before yesterday yutakikuchi / 20130107/1357514830) is all you need to do, so the grammar story is over in one go. It is the last inning suddenly.

Instead, today I would like to find out how different the usability is when dealing with a certain amount of huge data when using pandas, R, and Pig that I have dealt with so far.

Data to be verified

Consider a text file formed by a set of lines like this: The key items are date, primary key, store name, time stamp, converted time stamp, and numerical value. The delimiter is a tab delimiter.

20140205 XXXXXXAABBCC    Shop7 1391568621      2014-02-05 11:50:21 +0900       0

The calculator has about 100 million rows of data and a data size of 7.5 gigabytes. This time, let's find the average for this numerical data. Find out which one is the best to handle this task with pandas, R, and Pig.

The performance of the computer used for verification is Core i7 (Haswell) and memory 32GB.

Python (pandas)

In Python, there is a library for dataframe manipulation called pandas. Unless you have any restrictions, you should use this.

$ pip install pandas
$ ipython

In [1]: import pandas as pd
In [2]: df = pd.read_table('sample.txt', header=None)
In [3]: df.ix[:,5].mean()
Out[3]: 305.4479883399822

The features of pandas are as follows.

At the time of pd.read_table, all the data in the text file is in memory (memory usage increases by about 8 GB).
Wait time occurs at the time of pd.read_table
Data frame operation is very fast
The result of df.ix [:, 5] .mean () is also returned relatively fast.

In other words, pandas is good if the computer has enough performance to put the data to be operated in memory.

Speaking of data frame operations, it's R.

df <- read.table("sample.txt", sep="\t")
colMeans(df[6])
#=>     V6
#  305.448

In the case of R, the data is stored in memory when read.table like pandas, but when it comes to several gigabytes of data, the performance is clearly slower than pandas.

We also use colMeans () to calculate the average, but the execution speed of statistical functions was superior to that of pandas.

The features of R are summarized below.

Similar to Python's pandas, it consumes memory to allocate data frames
Processing performance is slower than pandas

Pig

Finally, Apache Pig. This time, we will use pig -x local because we will handle the text file on a single computer.

df = LOAD 'sample.txt' USING PigStorage('\t') AS (date: chararray, key: chararray, shop: chararray, unixtime: int, humantime: chararray, times: int);

grouped = group df all; 
times_mean = foreach grouped generate AVG(df.times);

dump times_mean;

#=> (305.4479883399822)

In the case of Pig, memory is not allocated even if you enter LOAD and subsequent functions. The interactive shell also responds instantly.

MapReduce is executed only after the last dump times_mean ;.

Since data is processed by MapReduce, memory cannot be secured at once.
Unlike pandas and R, it requires multiple steps to aggregate
Pig has an advantage when the computer memory is insufficient for the data to be processed.
If you want to perform distributed processing and analyze data, you will also use Pig.

Summary

I think it's better to use pandas if you have the ability to process data with a single calculator, and Pig if the computer's performance isn't enough for your data.