Yesterday I touched Pig with the title Basic grammar by Apache Pig (1), so today is of course the basic grammar ( 2) I wondered if it was, but this area of the reference link of the article of the day before yesterday yutakikuchi / 20130107/1357514830) is all you need to do, so the grammar story is over in one go. It is the last inning suddenly.
Instead, today I would like to find out how different the usability is when dealing with a certain amount of huge data when using pandas, R, and Pig that I have dealt with so far.
Consider a text file formed by a set of lines like this: The key items are date, primary key, store name, time stamp, converted time stamp, and numerical value. The delimiter is a tab delimiter.
20140205 XXXXXXAABBCC Shop7 1391568621 2014-02-05 11:50:21 +0900 0
The calculator has about 100 million rows of data and a data size of 7.5 gigabytes. This time, let's find the average for this numerical data. Find out which one is the best to handle this task with pandas, R, and Pig.
The performance of the computer used for verification is Core i7 (Haswell) and memory 32GB.
Python (pandas)
In Python, there is a library for dataframe manipulation called pandas. Unless you have any restrictions, you should use this.
$ pip install pandas
$ ipython
In [1]: import pandas as pd
In [2]: df = pd.read_table('sample.txt', header=None)
In [3]: df.ix[:,5].mean()
Out[3]: 305.4479883399822
The features of pandas are as follows.
In other words, pandas is good if the computer has enough performance to put the data to be operated in memory.
R
Speaking of data frame operations, it's R.
df <- read.table("sample.txt", sep="\t")
colMeans(df[6])
#=> V6
# 305.448
In the case of R, the data is stored in memory when read.table like pandas, but when it comes to several gigabytes of data, the performance is clearly slower than pandas.
We also use colMeans () to calculate the average, but the execution speed of statistical functions was superior to that of pandas.
The features of R are summarized below.
Pig
Finally, Apache Pig. This time, we will use pig -x local because we will handle the text file on a single computer.
df = LOAD 'sample.txt' USING PigStorage('\t') AS (date: chararray, key: chararray, shop: chararray, unixtime: int, humantime: chararray, times: int);
grouped = group df all;
times_mean = foreach grouped generate AVG(df.times);
dump times_mean;
#=> (305.4479883399822)
In the case of Pig, memory is not allocated even if you enter LOAD and subsequent functions. The interactive shell also responds instantly.
MapReduce is executed only after the last dump times_mean ;.
I think it's better to use pandas if you have the ability to process data with a single calculator, and Pig if the computer's performance isn't enough for your data.
Recommended Posts