In data analysis using Python's pandas, the groupby function is a convenient function that calculates for each group. I often use df.groupby (df ['col1']) ['col2'] **. Mean () ** and **. Describe () ** It is an orthodox function such as Sometimes I want to process each divided data frame, I found that it is convenient to combine the ** for ** statement and ** get_group **, so I will introduce it.
groupby_get_group.py
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np
iris_dataset = load_iris()
df_iris=pd.DataFrame(iris_dataset.data,columns=iris_dataset.feature_names)
#Add column for target
df_iris.loc[:,'target']=iris_dataset.target
#Create a dictionary of product names
iris_map=dict(zip([0,1,2],iris_dataset.target_names))
#Connect DataFrame and dictionary with map function target_Add columns for names
df_iris.loc[:,'target_names']=df_iris['target'].map(iris_map)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | target_names |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | 0 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | 0 | setosa |
- | - | - | - | - | - |
5.7 | 2.8 | 4.1 | 1.3 | 1 | versicolor |
- | - | - | - | - | - |
6.3 | 3.3 | 6.0 | 2.5 | 2 | virginica |
Split the data frame (** df_iris ) by breed ('target_names' **). The divided one is ** gp **.
groupby_get_group.py
gp = df_iris.groupby('target_names')
In[0]:type(gp)
Out[0]:pandas.core.groupby.generic.DataFrameGroupBy
In[1]:print(gp)
Out[1]:<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028788A33708>
You cannot use a dataset divided using groupby as it is. Therefore, let's investigate the attributes etc. using the for statement.
In[2]:for d_gp in gp:
print(d_gp)
Out[2]:
147 6.5 3.0 ... 2 virginica
148 6.2 3.4 ... 2 virginica
149 5.9 3.0 ... 2 virginica
[50 rows x 6 columns])
In[3]:type(d_gp)
out[3]:tuple
It seems that the divided data frame is stored as a tuple type variable (** d_gp **). Here, in order to check the contents of the tuple, if you type in the following,
In[4]:d_gp[0]
Out[4]:'virginica'
In[5]:d_gp[1]
Out[5]:
sepal length (cm) sepal width (cm) ... target target_names
100 6.3 3.3 ... 2 virginica
101 5.8 2.7 ... 2 virginica
102 7.1 3.0 ... 2 virginica
103 6.3 2.9 ... 2 virginica
147 6.5 3.0 ... 2 virginica
148 6.2 3.4 ... 2 virginica
149 5.9 3.0 ... 2 virginica
[50 rows x 6 columns]
Therefore, the state after execution of the for statement is that the data frame of the third level **'virginica' ** of ** "target_names" ** is assigned to ** d_gp **. You can check it.
Therefore, you can iterate only ** d_gp [1] **, but here we take advantage of ** d_gp [0] ** and retrieve a specific dataset with the ** get_group ** function. I will process it.
Tuples can be retrieved with the ** for ** statement, The first tuple contains the level of the row (variety: setosa, versicolor, virginica) multiplied by ** group by **. The second contains each data frame.
Using the level stored in the first of this tuple as a variable, ** get_group ** extracts the data frame stored in the second of the tuple and processes it for each level.
The following is a data frame divided for each setosa, versicolor, virginica variety. Specify the type and retrieve the data frame, A plot of "sepal length" and "sepal width".
groupby_get_group.py
for d_gp in gp:
df_g=gp.get_group(d_gp[0])
##Write what you want to process using the data frame divided below here
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
X=df_g[df_g.columns[0]].values
y=df_g[df_g.columns[1]].values
ax.set_title(str.capitalize(d_gp[0])+" "+\
str.capitalize(df_g.columns[0])+\
' vs '+str.capitalize(df_g.columns[1]))
ax.scatter(X,y,marker='o',color='darkblue',edgecolor="")
cor=np.corrcoef(X, y)[0,1]
ax.set_xlabel(str.capitalize(df_g.columns[0]))
ax.set_ylabel(str.capitalize(df_g.columns[1]))
ax.text(0.99, 0.01,"correlation:{0:.2}".format(cor),
horizontalalignment='right', verticalalignment='bottom',
fontsize=12,color="blue",transform=ax.transAxes)
plt.show()
that's all.
I will post to Qiita for the first time. Qiita was all about helping me, so I hope it helps someone.
Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization (impress top gear) (Japanese) Book (soft cover) ISBN-10: 4295005657 ISBN-13: 978-4295005650
Recommended Posts