When you want to add statistics for each column attribute to a feature, you may not need to create something like dict with collections or groupby and merge it. It's easy if you just put out the statistics, but I had a hard time using pandas.DataFrame.groupby when I wanted to add it to the record as a feature, so I will leave it as a memo.
What I want to say is that groupby.transform is convenient.
import pandas as pd
df = pd.DataFrame({
"site":["A","A","A","B","B","C"],
"dat":[15,30,30,30,10,50]
})
site | dat | |
---|---|---|
0 | A | 15 |
1 | A | 30 |
2 | A | 30 |
3 | B | 30 |
4 | B | 10 |
5 | C | 50 |
Features can be generated directly by changing the argument of transform to np.max or np.min. The same applies to median, var, etc. The code to calculate the average value for each site is shown below.
import numpy as np
df["site_mean"] = df.groupby("site").transform(np.mean)
site | dat | site_mean | |
---|---|---|---|
0 | A | 15 | 25 |
1 | A | 30 | 25 |
2 | A | 30 | 25 |
3 | B | 30 | 20 |
4 | B | 10 | 20 |
5 | C | 50 | 50 |
Count Encoding The method of making the number of appearances of (category) features of a certain column into new features is called count encoding. When combined with groupby, it can be characterized by something like rarity within an attribute. You can do it with collections.Counter, but this also ends with transform.
The code to convert to the number of occurrences of the site and dat pair is shown. (30 appearances on site A are 2 times)
df["count_site_dat"] = df.groupby(["site","dat"]).transform(np.size)
site | dat | site_mean | count_size_dat | |
---|---|---|---|---|
0 | A | 15 | 25 | 1 |
1 | A | 30 | 25 | 2 |
2 | A | 30 | 25 | 2 |
3 | B | 30 | 20 | 1 |
4 | B | 10 | 20 | 1 |
5 | C | 50 | 50 | 1 |
Among the data having a certain feature, calculate the largest data of the certain feature.
df["site_rank"] = df.groupby("site")["dat"].rank(method="dense")
site | dat | site_mean | count_size_dat | site_rank | |
---|---|---|---|---|---|
0 | A | 15 | 25 | 1 | 1 |
1 | A | 30 | 25 | 2 | 2 |
2 | A | 30 | 25 | 2 | 2 |
3 | B | 30 | 20 | 1 | 2 |
4 | B | 10 | 20 | 1 | 1 |
5 | C | 50 | 50 | 1 | 1 |
Changing the argument of rank mainly changes the expression method of the same value (same rank). For details, refer to the method of rank for ranking pandas.DataFrame, Series.
Recommended Posts