How to divide and process a data frame using the groupby function

About this article

In data analysis using Python's pandas, the groupby function is a convenient function that calculates for each group. I often use df.groupby (df ['col1']) ['col2'] **. Mean () ** and **. Describe () ** It is an orthodox function such as Sometimes I want to process each divided data frame, I found that it is convenient to combine the ** for ** statement and ** get_group **, so I will introduce it.

Data preparation

groupby_get_group.py


import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import numpy as np

iris_dataset = load_iris()
df_iris=pd.DataFrame(iris_dataset.data,columns=iris_dataset.feature_names)
#Add column for target
df_iris.loc[:,'target']=iris_dataset.target
#Create a dictionary of product names
iris_map=dict(zip([0,1,2],iris_dataset.target_names))
#Connect DataFrame and dictionary with map function target_Add columns for names
df_iris.loc[:,'target_names']=df_iris['target'].map(iris_map)

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target target_names
5.1 3.5 1.4 0.2 0 setosa
4.9 3.0 1.4 0.2 0 setosa
- - - - - -
5.7 2.8 4.1 1.3 1 versicolor
- - - - - -
6.3 3.3 6.0 2.5 2 virginica

Try applying the groupby function to target_names

Split the data frame (** df_iris ) by breed ('target_names' **). The divided one is ** gp **.

groupby_get_group.py


gp = df_iris.groupby('target_names')

Examine the attributes of the split object

In[0]:type(gp)
Out[0]:pandas.core.groupby.generic.DataFrameGroupBy

In[1]:print(gp)
Out[1]:<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028788A33708>

You cannot use a dataset divided using groupby as it is. Therefore, let's investigate the attributes etc. using the for statement.

Use for statement

In[2]:for d_gp in gp:
          print(d_gp)
Out[2]:
 147                6.5               3.0  ...       2     virginica
 148                6.2               3.4  ...       2     virginica
 149                5.9               3.0  ...       2     virginica
 
 [50 rows x 6 columns])

In[3]:type(d_gp)
out[3]:tuple

It seems that the divided data frame is stored as a tuple type variable (** d_gp **). Here, in order to check the contents of the tuple, if you type in the following,

In[4]:d_gp[0]
Out[4]:'virginica'
 
In[5]:d_gp[1]
Out[5]:
     sepal length (cm)  sepal width (cm)  ...  target  target_names
100                6.3               3.3  ...       2     virginica
101                5.8               2.7  ...       2     virginica
102                7.1               3.0  ...       2     virginica
103                6.3               2.9  ...       2     virginica

147                6.5               3.0  ...       2     virginica
148                6.2               3.4  ...       2     virginica
149                5.9               3.0  ...       2     virginica

[50 rows x 6 columns]

Therefore, the state after execution of the for statement is that the data frame of the third level **'virginica' ** of ** "target_names" ** is assigned to ** d_gp **. You can check it.

Therefore, you can iterate only ** d_gp [1] **, but here we take advantage of ** d_gp [0] ** and retrieve a specific dataset with the ** get_group ** function. I will process it.

Extract the data stored in the tuple by get_group.

Tuples can be retrieved with the ** for ** statement, The first tuple contains the level of the row (variety: setosa, versicolor, virginica) multiplied by ** group by **. The second contains each data frame.

Using the level stored in the first of this tuple as a variable, ** get_group ** extracts the data frame stored in the second of the tuple and processes it for each level.

The following is a data frame divided for each setosa, versicolor, virginica variety. Specify the type and retrieve the data frame, A plot of "sepal length" and "sepal width".

groupby_get_group.py


for d_gp in gp:
    df_g=gp.get_group(d_gp[0])
    ##Write what you want to process using the data frame divided below here
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    X=df_g[df_g.columns[0]].values
    y=df_g[df_g.columns[1]].values
    ax.set_title(str.capitalize(d_gp[0])+"  "+\
                 str.capitalize(df_g.columns[0])+\
                 ' vs '+str.capitalize(df_g.columns[1]))
    ax.scatter(X,y,marker='o',color='darkblue',edgecolor="")
    cor=np.corrcoef(X, y)[0,1]
    ax.set_xlabel(str.capitalize(df_g.columns[0]))
    ax.set_ylabel(str.capitalize(df_g.columns[1]))
    ax.text(0.99, 0.01,"correlation:{0:.2}".format(cor),
                    horizontalalignment='right', verticalalignment='bottom',
                    fontsize=12,color="blue",transform=ax.transAxes)
    plt.show()

Figure setosa.png Figure Versicolor.png Figure Virginica.png

that's all.

in conclusion

I will post to Qiita for the first time. Qiita was all about helping me, so I hope it helps someone.

References

Basic coding for Python data analysis / machine learning! Introduction to pandas library utilization (impress top gear) (Japanese) Book (soft cover) ISBN-10: 4295005657 ISBN-13: 978-4295005650

Recommended Posts

How to divide and process a data frame using the groupby function
[Circuit x Python] How to find the transfer function of a circuit using Lcapy
[Python] Smasher tried to make the video loading process a function using a generator
How to write a GUI using the maya command
How to call a function
How to unit test a function containing the current time using freezegun in python
How to find out which process is using the localhost port and stop it
How to add new data (lines and plots) using matplotlib
How to generate a query using the IN operator in Django
Build a Python environment and transfer data to the server
[Introduction to Python] How to get data with the listdir function
[Linux] [C / C ++] How to get the return address value of a function and the function name of the caller
[C / C ++] Pass the value calculated in C / C ++ to a python function to execute the process, and use that value in C / C ++.
How to use the zip function
How to make a recursive function
How to get only the data you need from a structured data set using a versatile method
How to insert a specific process at the start and end of spider with scrapy
How to build a LAMP environment using Vagrant and VirtulBox Note
[Introduction to Python] How to split a character string with the split function
Process Splunk execution results using Python and save to a file
How to get followers and followers from python using the Mastodon API
[C language] How to create, avoid, and make a zombie process
How to format a table using Pandas apply, pivot and swaplevel
How to update a Tableau packaged workbook data source using Python
How to split and save a DataFrame
How to draw a graph using Matplotlib
How to install a package using a repository
[Python] Explains how to use the range function with a concrete example
[Python] How to use the enumerate function (extract the index number and element)
How to fix the initial population with a genetic algorithm using DEAP
How to create a wrapper that preserves the signature of the function to wrap
[Introduction to Python] How to write a character string with the format function
[Development environment] How to create a data set close to the production DB
How to copy and paste the contents of a sheet in Google Spreadsheet in JSON format (using Google Colab)
Function to extract the maximum and minimum values ​​in a slice with Go
How to calculate the volatility of a brand
How to code a drone using image recognition
How to create a function object from a string
How to count the number of elements in Django and output to a template
[Python] How to read data from CIFAR-10 and CIFAR-100
How to get article data using Qiita API
I'm addicted to the difference in how Flask and Django receive JSON data
How to read a serial number file in a loop, process it, and graph it
Change the data frame of pandas purchase data (id x product) to a dictionary
I tried to process and transform the image and expand the data for machine learning
[Python] How to use hash function and tuple.
How to search HTML data using Beautiful Soup
Data cleaning How to handle missing and outliers
The first step to log analysis (how to format and put log data in Pandas)
[Go] How to write or call a function
On Linux (Ubuntu), tune the Trackpad and set the function to a three-finger swipe
How to upload to a shared drive using pydrive
How to uninstall a module installed using setup.py
How to Mock a Public function in Pytest
[Introduction to Python] How to get the index of data with a for statement
How to execute a schedule by specifying the Python time zone and execution frequency
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
How to use the grep command to recursively search directories and files to a specified depth
[For beginners] How to display maps and search boxes using the GoogleMap Javascript API
How to confirm the Persival theorem using the Fourier transform (FFT) of matplotlib and scipy
Read the Python-Markdown source: How to create a parser