You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4

Click here until yesterday

This time is a continuation of the story about machine learning.

About the data processing flow of machine learning

The flow of work when incorporating machine learning is as follows.

Determine the purpose
Data acquisition
Data understanding / selection / processing
Data mart (data set) creation
Model creation
Accuracy verification
System implementation

Of these, 2-3 parts are called data preprocessing.

This time, I would like to create a data mart out of this preprocessing.

About creating a data mart

Language is python Libraries for machine learning are Pandas and Numpy The library for visualization uses seaborn, matplotlib.

** Loading library **

#Loading the library
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

** Data details **

The data used this time is the Titanic passenger list.

PassengerID:Passenger ID
Survived:Survival result(0=death, 1=Survival)
Pclass:Passenger class(1 seems to be the best)
Name:Passenger name
Sex:sex
Age:age
SibSp Number of siblings and spouses
Parch Number of parents and children
Ticket Ticket number
Fare boarding fee
Cabin room number
Embarked Port on board

Suppose you have a file called titanic_train.csv.

** Read file **

In the pandas library, there are many reading methods for the file format called read_xxx, so use them to read the file. This time it's a CSV file, so it's read_csv.

The pandas library is a library that handles data formats called tabular data frames. Load the file into the data frame.

#Read data from file
file_path = 'data/titanic_train.csv'
train_df = pd.read_csv(file_path,encoding='utf-8')
train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05	NaN	S

The data looks like this.

Last time, I looked at the contents of various data and asked what kind of data could be used. This time, we will continue with this, using data that seems to be usable, and incorporate it into data for machine learning.

** Check for missing values **

When the data is read, if there is no data, it will be treated as a missing value on the data frame.

How many missing values are Data frame .isnull (). Sum () You can check the number of missing values in each column with.

train_df.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

Looking at this, it seems that only some columns have missing values. There seems to be a defect in "Age", "Cabin", and "Embarked".

Let's display only the part with missing values.

** Extract rows that match the conditions **

Data frame [conditional expression]

** Extract rows with missing values **

Data frame [Data frame ['column name']. Isnull ()]

train_df[train_df['Embarked'].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38	0	0	113572	80	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62	0	0	113572	80	B28	NaN

If you look at the value in the ʻEmbarked column, it's NaN. Missing values on the data frame are displayed as NaN`.

In case of missing numbers ·Average value ·Median ・ Arbitrary value In many cases, missing values are complemented with.

Category values such as ʻEmbarked` are not numbers and cannot be replaced with any numbers.

If you want to fill in the missing values, you can fill them with fillna.

** Complement missing values with arbitrary values **

Data frame .fillna (filling value)

If you want to complement with the average value of that column, first find the average value.

** Calculate column average **

Data frame ['column name']. Mean ()

** Find the median of the column **

Dataframe ['column name']. median ()

print(train_df['Fare'].mean())
print(train_df['Fare'].median())

32.2042079685746 14.4542

#Complement age with mean
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())

** Vectorization of category values **

In machine learning, basically all the values used for calculation must be numerical values. In the case of category values composed of character strings, in many cases, except for some models, they cannot be used as machine learning data as they are.

Therefore, it converts the category value to a numerical value as ʻone-hot vector`.

** Make the category value a one-hot vector **

ʻOne-hot vector` is the data that creates a column of category values and sets the value to 1 if the column name is different and 0 if it is different.

pd.get_dummies (dataframe [['column name']])

If prefix ='preposition' is added, a name can be given before the column name. This will create a new data frame.

#Categorification of boarding locations
train_df["Embarked"] = train_df["Embarked"].fillna('N')
one_hot_df = pd.get_dummies(train_df["Embarked"],prefix='Em')
one_hot_df.head()

	Em_C	Em_S
0	0	1
1	1	0
2	0	1
3	0	1
4	0	1

Since the ʻEmbarkedcolumn has a defect, it is made into a category value after replacing the defect withN`. A new data frame is generated with the category value replaced by 1 where it does not exist.

You will create columns for each type of data. If there are too many types of data, the data will be raw (almost 0). It is a good idea to create categorical variables only for those that are limited to some extent.

** Conversion of numbers and strings ** We will change the data that is a character string to a numerical value, or convert the numerical value to a character string to make it data that can be used for machine learning.

Since Gender (Sex) is a character string, it cannot be used for machine learning as it is. We will convert from a character string to a numerical value.

Data frame ['column name']. replace ({value: value, value: value ...})

#Gender quantification(0 men,1 woman)
train_df['Sex2'] = train_df['Sex'].replace({'male':0,'female':1})
train_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex2
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25	NaN	S	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C	1
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1	C123	S	1
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05	NaN	S	0

Created a new gender column.

You can also use replace when converting from a number to a string. When changing the entire column data type

Data frame ['column name']. Astype (np. Data type)

** Numericalization **

Let's calculate the age using age. If you divide the age by 10 to get the age. You can also create a column for missing items without age.


#Age categorization
train_df['period'] = train_df['Age']//10
train_df['period'] = train_df['period'].fillna('NaN')
train_df['period'] = train_df['period'].astype(np.str)
period_df = pd.get_dummies(train_df["period"],prefix='Pe')
period_df.head()

	Pe_2.0	Pe_3.0
0	1	0
1	0	1
2	1	0
3	0	1
4	0	1

** Combine data frames **

Combine the newly created data frames into one. Use pd.concat to put it together.

pd.concat ([dataframe, dataframe], axis = 1)

Axis = 1 should be combined horizontally and axis = 0 should be combined vertically. It is necessary to match the number of columns and rows when combining.


con_df = pd.concat([train_df,period_df],axis=1)
con_df = pd.concat([con_df,one_hot_df],axis=1)
con_df.head(1)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	Pe_3.0	Pe_4.0	Pe_5.0	Pe_6.0	Pe_7.0	Pe_8.0	Em_C	Em_N	Em_Q	Em_S
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25	...	0	0	0	0	0	0	0	0	0	1

You have now concatenated the data frames horizontally.

** Delete unnecessary data **

When joining the data frames, the original column is deleted because the data before conversion is not required.

Dataframe.drop (['column name'], axis = 1)

data_df = con_df.drop(['PassengerId','Pclass','Name','Age','Ticket','Cabin','Embarked','period','Sex'], axis=1)
data_df.head()

	Survived	SibSp	Fare	Sex2	Pe_2.0	Pe_3.0	Em_C	Em_S
0	0	1	7.25	0	1	0	0	1
1	1	1	71.2833	1	0	1	1	0
2	1	0	7.925	1	1	0	0	1
3	1	1	53.1	1	0	1	0	1
4	0	0	8.05	0	0	1	0	1

All data are numerical values. In this way, it can be used as the final data for machine learning.

Summary

Today, I processed the data and created a data mart for machine learning. We have introduced only a few processing methods today.

First, let's learn the rough flow. And once you understand it to a certain extent, I think it's a good idea to devise ways to improve the accuracy or try new methods.

21 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython