Introduction

In data analysis such as machine learning, data preprocessing is indispensable, and especially in practice, all the data is unexpected. However, python has the strongest module called pandas, and there is almost nothing you can't do in data processing of structured data. Among pandas, DataFrame that handles table data is used very frequently, so we will explain about DataFrame here. The actual code is listed on github, so please refer to it. https://github.com/andever-flatfish/pandas-tips/blob/master/notebook/pandas-tips.ipynb

First, do each tip using the following DataFrame.

import pandas as pd

df = pd.DataFrame([['tokyo', 'male', 21, 165, 63],
                   ['osaka', 'male', 28, 170, 71],
                   ['fukuoka', 'female', 32, 175, 58],
                   ['tokyo', 'male', 21, 165, 63],
                   ['osaka', 'female', 28, 175, 70],
                   ['fukuoka', 'male', 32, 155, 58],
                   ['tokyo', 'female', 21, 165, 63],
                   ['osaka', 'male', 28, 172, 67],
                   ['fukuoka', 'male', 42, 155, 48]],
                   columns=['area', 'gender', 'age', 'height', 'weight'])

Data type confirmation / type conversion

Since python is a dynamically typed language, you basically don't need to be aware of the type. However, it often happens that an error occurs due to something unexpected. For example, the data that I thought was an int type has a space in only one data and is an object type. In such a case, check the data type below.


df.dtypes

area      object
gender    object
age        int64
height     int64
weight     int64
dtype: object

If you want to see only one column, you can check below.


df['height'].dtype

dtype('int64')

If you want to convert int type to float type, you can do as follows. If you check with df ['height']. Dtype above, you can see that it has been converted.


df['height'] = df['height'].astype(float)

However, in order to convert object type to int type or float type, it is necessary to modify or remove the data that became object type (for example, space) and then convert it with astype.

Check the column name of DataFrame

Often you want to see the column names in a DataFrame. Furthermore, if you want to do something for each column, it is convenient to keep the column names in a list.


df_columns = list(df.columns)
print(df_columns)

['area', 'gender', 'age', 'height', 'weight']

Extract only the data of a specific row

Only data that meets certain conditions can be extracted row by row. For example, if you want to extract data whose'area'is only'fukuoka' in df, you can extract it as follows.


part_of_df1 = df[df['area']=='fukuoka']

Also, if you want to extract the data whose'area'is'tokyo'or'osaka' in df, you can extract it as follows.


part_of_df2 = df[df['area'].isin(['tokyo', 'osaka'])]

Reassign the numbers in the index

When only the data of a specific row is extracted, the index remains the index of the original DataFrame, so reassign the index as follows.


part_of_df1.reset_index(drop=True, inplace=True)

Change the order of columns

If you want to change the order of the columns in the DataFrame, you can list the columns in the order you want to sort and call that list.


sorted_columns_list1 = ['height', 'weight', 'gender', 'area', 'age']
sorted_df1 = df[sorted_columns_list1]

In addition to sorting, it is also possible to extract only specific columns.


sorted_columns_list2 = ['area', 'gender', 'age']
sorted_df2 = df[sorted_columns_list2]

Change column name

Use rename to modify the column names as follows:


rename_df = df.rename(columns={'area':'area',
                               'gender':'sex', 
                               'age':'age', 
                               'height':'height', 
                               'weight':'body weight'})

Union other data frames

Use append to union different data with the same column name.


other_df = pd.DataFrame([['hokkaido', 'male', 25, 162, 60],
                         ['hokkaido', 'female', 38, 179, 81]],
                         columns=['area', 'gender', 'age', 'height', 'weight'])

union_df = df.append(other_df, ignore_index=True)

Calculate basic statistics

In order to confirm numerical data, if you calculate the basic statistics, you may get outliers.


df.describe()

Import / export csv

The data can be easily read as follows. If df is garbled, enter encoding.


df = pd.read_csv('data.csv')

If there is no header in the original csv, write as follows.


df = pd.read_csv('data.csv', header=None)

You can easily output as follows. If it is just a serial number index, it is often not output.


df.to_csv('data.csv', index=False)

If you don't use it often, but you don't want to output header, write as follows.


df.to_csv('data.csv', header=False, index=False)

Pandas / DataFrame Tips for practical use