I was processing data for machine learning with pandas, I wanted to standardize by group of some columns rather than standardize as a whole. However, there was a scene where standardization was not necessary for the group name, but the standardization process was desired while retaining the group name. It's just a memo.
pandas = 0.25.3 numpy = 1.18.0
Standardize columns for each class name in a table like the one below
class | a | b | c | |
---|---|---|---|---|
a | 1.0 | 2.0 | 3.0 | |
a | 4.0 | 5.0 | 6.0 | |
b | 7.0 | 8.0 | 9.0 | |
b | 10.0 | 11.0 | 12.0 |
import pandas as pd
import numpy as np
# make data set
df = pd.DataFrame(np.arange(12).reshape(4, 3),
columns=['col_0', 'col_1', 'col_2'],
index=['row_0', 'row_1', 'row_2','row_3'])
df["class"] = ["a", "a", "b", "b"]
# Standardization for each group
class_ = df[["class"]]
class_names = df.groupby("class").groups.keys()
for name in class_names:
df_tmp = df[(df['class'] == name)].drop(columns=['class'])
df[(df['class'] == name)] = (df_tmp - df_tmp.mean()) /df_tmp.std()
df["class"] = class_
First post. .. It's just a memo. Please let me know if there is a better way.
Recommended Posts