About this article

In data analysis using Python's pandas, the groupby function is a convenient function that calculates for each group. I often use df.groupby (df ['col1']) ['col2'] **. Mean () ** and **. Describe () ** It is an orthodox function such as Sometimes I want to process each divided data frame, I found that it is convenient to combine the ** for ** statement and ** get_group **, so I will introduce it.

Data preparation

`groupby_get_group.py`


import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

iris_dataset = load_iris()
df_iris=pd.DataFrame(iris_dataset.data,columns=iris_dataset.feature_names)
#Add column for target
df_iris.loc[:,'target']=iris_dataset.target
#Create a dictionary of product names
iris_map=dict(zip([0,1,2],iris_dataset.target_names))
#Connect DataFrame and dictionary with map function target_Add columns for names
df_iris.loc[:,'target_names']=df_iris['target'].map(iris_map)

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target	target_names
5.1	3.5	1.4	0.2	0	setosa
4.9	3.0	1.4	0.2	0	setosa
-	-	-	-	-	-
5.7	2.8	4.1	1.3	1	versicolor
-	-	-	-	-	-
6.3	3.3	6.0	2.5	2	virginica

Try applying the groupby function to target_names

Split the data frame (** df_iris ) by breed ('target_names' **). The divided one is ** gp **.

`groupby_get_group.py`


gp = df_iris.groupby('target_names')

Examine the attributes of the split object

In[0]:type(gp)
Out[0]:pandas.core.groupby.generic.DataFrameGroupBy

In[1]:print(gp)
Out[1]:<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028788A33708>

You cannot use a dataset divided using groupby as it is. Therefore, let's investigate the attributes etc. using the for statement.

Use for statement

In[2]:for d_gp in gp:
          print(d_gp)
Out[2]:
 147                6.5               3.0  ...       2     virginica
 148                6.2               3.4  ...       2     virginica
 149                5.9               3.0  ...       2     virginica
 
 [50 rows x 6 columns])

In[3]:type(d_gp)
out[3]:tuple

It seems that the divided data frame is stored as a tuple type variable (** d_gp **). Here, in order to check the contents of the tuple, if you type in the following,

In[4]:d_gp[0]
Out[4]:'virginica'
 
In[5]:d_gp[1]
Out[5]:
     sepal length (cm)  sepal width (cm)  ...  target  target_names
100                6.3               3.3  ...       2     virginica
101                5.8               2.7  ...       2     virginica
102                7.1               3.0  ...       2     virginica
103                6.3               2.9  ...       2     virginica

147                6.5               3.0  ...       2     virginica
148                6.2               3.4  ...       2     virginica
149                5.9               3.0  ...       2     virginica

[50 rows x 6 columns]

Therefore, the state after execution of the for statement is that the data frame of the third level **'virginica' ** of ** "target_names" ** is assigned to ** d_gp **. You can check it.

Therefore, you can iterate only ** d_gp [1] **, but here we take advantage of ** d_gp [0] ** and retrieve a specific dataset with the ** get_group ** function. I will process it.

Extract the data stored in the tuple by get_group.

Tuples can be retrieved with the ** for ** statement, The first tuple contains the level of the row (variety: setosa, versicolor, virginica) multiplied by ** group by **. The second contains each data frame.

Using the level stored in the first of this tuple as a variable, ** get_group ** extracts the data frame stored in the second of the tuple and processes it for each level.

The following is a data frame divided for each setosa, versicolor, virginica variety. Specify the type and retrieve the data frame, A plot of "sepal length" and "sepal width".

`groupby_get_group.py`


for d_gp in gp:
    df_g=gp.get_group(d_gp[0])
    ##Write what you want to process using the data frame divided below here
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    X=df_g[df_g.columns[0]].values
    y=df_g[df_g.columns[1]].values
    ax.set_title(str.capitalize(d_gp[0])+"  "+\
                 str.capitalize(df_g.columns[0])+\
                 ' vs '+str.capitalize(df_g.columns[1]))
    ax.scatter(X,y,marker='o',color='darkblue',edgecolor="")
    cor=np.corrcoef(X, y)[0,1]
    ax.set_xlabel(str.capitalize(df_g.columns[0]))
    ax.set_ylabel(str.capitalize(df_g.columns[1]))
    ax.text(0.99, 0.01,"correlation:{0:.2}".format(cor),
                    horizontalalignment='right', verticalalignment='bottom',
                    fontsize=12,color="blue",transform=ax.transAxes)
    plt.show()

Figure setosa.png Figure Versicolor.png Figure Virginica.png

that's all.

in conclusion

I will post to Qiita for the first time. Qiita was all about helping me, so I hope it helps someone.

References

Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization (impress top gear) (Japanese) Book (soft cover) ISBN-10: 4295005657 ISBN-13: 978-4295005650

How to divide and process a data frame using the groupby function

About this article

Data preparation

groupby_get_group.py

Try applying the groupby function to target_names

groupby_get_group.py

Examine the attributes of the split object

Use for statement

Extract the data stored in the tuple by get_group.

groupby_get_group.py

in conclusion

References

`groupby_get_group.py`

`groupby_get_group.py`

`groupby_get_group.py`