In data analysis such as machine learning, data preprocessing is indispensable, and especially in practice, all the data is unexpected. However, python has the strongest module called pandas, and there is almost nothing you can't do in data processing of structured data. Among pandas, DataFrame that handles table data is used very frequently, so we will explain about DataFrame here. The actual code is listed on github, so please refer to it. https://github.com/andever-flatfish/pandas-tips/blob/master/notebook/pandas-tips.ipynb
First, do each tip using the following DataFrame.
import pandas as pd
df = pd.DataFrame([['tokyo', 'male', 21, 165, 63],
['osaka', 'male', 28, 170, 71],
['fukuoka', 'female', 32, 175, 58],
['tokyo', 'male', 21, 165, 63],
['osaka', 'female', 28, 175, 70],
['fukuoka', 'male', 32, 155, 58],
['tokyo', 'female', 21, 165, 63],
['osaka', 'male', 28, 172, 67],
['fukuoka', 'male', 42, 155, 48]],
columns=['area', 'gender', 'age', 'height', 'weight'])
Since python is a dynamically typed language, you basically don't need to be aware of the type. However, it often happens that an error occurs due to something unexpected. For example, the data that I thought was an int type has a space in only one data and is an object type. In such a case, check the data type below.
df.dtypes
area object
gender object
age int64
height int64
weight int64
dtype: object
If you want to see only one column, you can check below.
df['height'].dtype
dtype('int64')
If you want to convert int type to float type, you can do as follows. If you check with df ['height']. Dtype above, you can see that it has been converted.
df['height'] = df['height'].astype(float)
However, in order to convert object type to int type or float type, it is necessary to modify or remove the data that became object type (for example, space) and then convert it with astype.
Often you want to see the column names in a DataFrame. Furthermore, if you want to do something for each column, it is convenient to keep the column names in a list.
df_columns = list(df.columns)
print(df_columns)
['area', 'gender', 'age', 'height', 'weight']
Only data that meets certain conditions can be extracted row by row. For example, if you want to extract data whose'area'is only'fukuoka' in df, you can extract it as follows.
part_of_df1 = df[df['area']=='fukuoka']
Also, if you want to extract the data whose'area'is'tokyo'or'osaka' in df, you can extract it as follows.
part_of_df2 = df[df['area'].isin(['tokyo', 'osaka'])]
When only the data of a specific row is extracted, the index remains the index of the original DataFrame, so reassign the index as follows.
part_of_df1.reset_index(drop=True, inplace=True)
If you want to change the order of the columns in the DataFrame, you can list the columns in the order you want to sort and call that list.
sorted_columns_list1 = ['height', 'weight', 'gender', 'area', 'age']
sorted_df1 = df[sorted_columns_list1]
In addition to sorting, it is also possible to extract only specific columns.
sorted_columns_list2 = ['area', 'gender', 'age']
sorted_df2 = df[sorted_columns_list2]
Use rename to modify the column names as follows:
rename_df = df.rename(columns={'area':'area',
'gender':'sex',
'age':'age',
'height':'height',
'weight':'body weight'})
Use append to union different data with the same column name.
other_df = pd.DataFrame([['hokkaido', 'male', 25, 162, 60],
['hokkaido', 'female', 38, 179, 81]],
columns=['area', 'gender', 'age', 'height', 'weight'])
union_df = df.append(other_df, ignore_index=True)
In order to confirm numerical data, if you calculate the basic statistics, you may get outliers.
df.describe()
The data can be easily read as follows. If df is garbled, enter encoding.
df = pd.read_csv('data.csv')
If there is no header in the original csv, write as follows.
df = pd.read_csv('data.csv', header=None)
You can easily output as follows. If it is just a serial number index, it is often not output.
df.to_csv('data.csv', index=False)
If you don't use it often, but you don't want to output header, write as follows.
df.to_csv('data.csv', header=False, index=False)
Recommended Posts