Let's process the same table data in the same way with dplyr in R and pandas in python.
Which is faster? I was curious, so I looked it up.
Let's make a baseball number ranking .csv from the 2013 Major League baseball at-bat result data (77MB, about 190,000 lines).
The script dplyr.R written using R's dplyr is
library(data.table)
library(dplyr)
##Data read
dat = fread("all2013.csv")
##Aggregate
dat %>% select(BAT_ID, H_FL) %>%
group_by(BAT_ID) %>%
summarise(BASE = sum(H_FL)) %>%
arrange(desc(BASE)) %>%
write.csv("hoge.csv")
Like this.
> time R -f dplyr.R
R -f dplyr.R 3.13s user 0.15s system 99% cpu 3.294 total
With python pandas,
#!/usr/bin/python
import pandas as pd
df = pd.read_csv('all2013.csv')
df[["BAT_ID", "H_FL"]].groupby("BAT_ID").sum().sort("H_FL", ascending=False).to_csv('hoge.csv')
Like this.
> time ./pd.py
./pd.py 3.12s user 0.40s system 98% cpu 3.567 total
3.29 seconds for dplyr, 3.56 seconds for pandas.
dplyr is a little better.
With 77MB of data, neither seems to be particularly fast.
Is it OK if you use someone who is used to it?
that's all.
Recommended Posts