Use pandas_ply.
Do something like R's dplyr with pandas.
With method chains, it's easy and fun to write code.
This time, let's play with the election data.
Using House of Representatives single-seat constituency candidate name and data on the number of hits when searching by name, pandas Play with -ply.
Preparations for using pandas_ply.
import pandas as pd
from ply import install_ply, X, sym_call
install_ply(pd)
Read the data.
It is the data of the candidate name + the number of hits when googled.
data = pd.read_csv("../kouho.hit.list", encoding="utf-8", header=0)
print data.head(2)
BLOCK NAME AGE PARTY STATUS HIT
0 Hokkaido 1st District Takahiro Yokomichi 73 Before Democracy 153000
1 Hokkaido 1st District Hiroyuki Noroda 56 Communist New 346000
You can group them and aggregate them.
partySummarize = (
data
.groupby('PARTY')
.ply_select(
meanAge=X.AGE.mean(),
candidateNum=X.NAME.size(),
)
)
print partySummarize
candidateNum meanAge
PARTY
Komei 9 52.111111
Communism 292 53.188356
Next generation 39 50.461538
Democracy 178 50.595506
Nowhere 45 53.177778
Life 13 54.230769
Social Democratic Party 18 56.833333
Restoration 77 45.311688
Liberal Democratic Party 283 53.346290
Various factions 5 52.400000
The dplyr :: filter corresponds to ply_where.
## under 30
print (data
.ply_where(X.AGE < 30)
.head(10)
)
21 Hokkaido 7th District Takako Suzuki 28 Before Democracy 1670000
88 Akita 2nd District Takashi Midorikawa 29 Democratic New 170000
174 Saitama 1st District Sho Matsumoto 29 Social Democratic Party New 3070000
221 Chiba 1st District Naoyoshi Yoshida 27 Communist New 1690000
269 Tokyo 1st District Takanobu Nozaki 27 Nowhere New 530000
271 Tokyo 2nd District Noriyuki Ishizawa 27 Communist New 156000
297 Tokyo 8th District Shingo Sawada 29 Communist New 400000
306 Tokyo 11th District Shimomura Mei 27 Nosho New 380000
390 Kanagawa 8th District Yasuhisa Wakabayashi 29 Communist New 525000
403 Kanagawa 12th District Kotaro Amimura 25 Communist New 106000
The operation corresponding to dplyr :: mutant is also possible in ply_select.
print (data
.ply_select(
NAME=X.NAME,
HIT_x10000 = X.HIT / 1000
)
.head(10)
)
HIT_x10000 NAME
0 15.30 Takahiro Yokomichi
1 34.60 Hiroyuki Noroda
2 268.00 Toshimitsu Funahashi
3 54.30 Yoshihiro Iida
4 54.10 Takamori Yoshikawa
5 152.00 Maki Ikeda
6 7.42 Kenko Matsuki
7 5.92 Masatoshi Kanakura
8 33.50 Satoshi Arai
9 30.30 Hiroko Yoshioka
I wonder why the order of the columns is changed.
More fun than using raw pandas.
I don't know how to sort. What corresponds to dplyr :: arrange?
that's all.
This post was posted from Github: point_right: Qiita.
Recommended Posts