Pandas / DataFrame Tips for practical use

Introduction

In data analysis such as machine learning, data preprocessing is indispensable, and especially in practice, all the data is unexpected. However, python has the strongest module called pandas, and there is almost nothing you can't do in data processing of structured data. Among pandas, DataFrame that handles table data is used very frequently, so we will explain about DataFrame here. The actual code is listed on github, so please refer to it. https://github.com/andever-flatfish/pandas-tips/blob/master/notebook/pandas-tips.ipynb

First, do each tip using the following DataFrame.

import pandas as pd

df = pd.DataFrame([['tokyo', 'male', 21, 165, 63],
                   ['osaka', 'male', 28, 170, 71],
                   ['fukuoka', 'female', 32, 175, 58],
                   ['tokyo', 'male', 21, 165, 63],
                   ['osaka', 'female', 28, 175, 70],
                   ['fukuoka', 'male', 32, 155, 58],
                   ['tokyo', 'female', 21, 165, 63],
                   ['osaka', 'male', 28, 172, 67],
                   ['fukuoka', 'male', 42, 155, 48]],
                   columns=['area', 'gender', 'age', 'height', 'weight'])
sample_df.png

Data type confirmation / type conversion

Since python is a dynamically typed language, you basically don't need to be aware of the type. However, it often happens that an error occurs due to something unexpected. For example, the data that I thought was an int type has a space in only one data and is an object type. In such a case, check the data type below.


df.dtypes
area      object
gender    object
age        int64
height     int64
weight     int64
dtype: object

If you want to see only one column, you can check below.


df['height'].dtype
dtype('int64')

If you want to convert int type to float type, you can do as follows. If you check with df ['height']. Dtype above, you can see that it has been converted.


df['height'] = df['height'].astype(float)

However, in order to convert object type to int type or float type, it is necessary to modify or remove the data that became object type (for example, space) and then convert it with astype.

Check the column name of DataFrame

Often you want to see the column names in a DataFrame. Furthermore, if you want to do something for each column, it is convenient to keep the column names in a list.


df_columns = list(df.columns)
print(df_columns)
['area', 'gender', 'age', 'height', 'weight']

Extract only the data of a specific row

Only data that meets certain conditions can be extracted row by row. For example, if you want to extract data whose'area'is only'fukuoka' in df, you can extract it as follows.


part_of_df1 = df[df['area']=='fukuoka']
part_of_df.png

Also, if you want to extract the data whose'area'is'tokyo'or'osaka' in df, you can extract it as follows.


part_of_df2 = df[df['area'].isin(['tokyo', 'osaka'])]
part_of_df2.png

Reassign the numbers in the index

When only the data of a specific row is extracted, the index remains the index of the original DataFrame, so reassign the index as follows.


part_of_df1.reset_index(drop=True, inplace=True)
reset_index_part_of_df.png

Change the order of columns

If you want to change the order of the columns in the DataFrame, you can list the columns in the order you want to sort and call that list.


sorted_columns_list1 = ['height', 'weight', 'gender', 'area', 'age']
sorted_df1 = df[sorted_columns_list1]
sorted_df1.png

In addition to sorting, it is also possible to extract only specific columns.


sorted_columns_list2 = ['area', 'gender', 'age']
sorted_df2 = df[sorted_columns_list2]
sorted_df2.png

Change column name

Use rename to modify the column names as follows:


rename_df = df.rename(columns={'area':'area',
                               'gender':'sex', 
                               'age':'age', 
                               'height':'height', 
                               'weight':'body weight'})
rename_df.png

Union other data frames

Use append to union different data with the same column name.


other_df = pd.DataFrame([['hokkaido', 'male', 25, 162, 60],
                         ['hokkaido', 'female', 38, 179, 81]],
                         columns=['area', 'gender', 'age', 'height', 'weight'])

union_df = df.append(other_df, ignore_index=True)
union_df.png

Calculate basic statistics

In order to confirm numerical data, if you calculate the basic statistics, you may get outliers.


df.describe()
numerical_data.png

Import / export csv

The data can be easily read as follows. If df is garbled, enter encoding.


df = pd.read_csv('data.csv') 

If there is no header in the original csv, write as follows.


df = pd.read_csv('data.csv', header=None) 

You can easily output as follows. If it is just a serial number index, it is often not output.


df.to_csv('data.csv', index=False)

If you don't use it often, but you don't want to output header, write as follows.


df.to_csv('data.csv', header=False, index=False)

Recommended Posts

Pandas / DataFrame Tips for practical use
Survey for practical use of BlockChain
Tips for plotting multiple lines with pandas
[pandas] GroupBy Tips
Python pandas: Search for DataFrame using regular expressions
[For recording] Pandas memorandum
3D plot Pandas DataFrame
How to use Pandas 2
Use DataFrame in Java
[Tips] My Pandas Note
Extract N samples for each group with Pandas DataFrame
Use Mean in DataFrame
Summary of pre-processing practices for Python beginners (Pandas dataframe)
Try basic operations for Pandas DataFrame on Jupyter Notebook
Python application: Pandas # 3: Dataframe
Tips for Python beginners to use Scikit-image examples for themselves 4 Use GUI
Convert from Pandas DataFrame to System.Data.DataTable using Python for .NET
Tips for Python beginners to use the Scikit-image example for themselves
Use PySide for HDA UI
Formatted display of pandas DataFrame
How to use Pandas Rolling
100 Pandas knocks for Python beginners
[Python + Selenium] Tips for scraping
~ Tips for beginners to Python ③ ~
Export pandas dataframe to excel
Data processing tips with Pandas
Tips for data analysis ・ Notes
Use models utils.Choices conveniently TIPS