Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
The flow of work when incorporating machine learning is as follows.
Of these, 2-3 parts are called data preprocessing.
This time, I would like to create a data mart out of this preprocessing.
Language is python
Libraries for machine learning are Pandas
and Numpy
The library for visualization uses seaborn
, matplotlib
.
** Loading library **
#Loading the library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
** Data details **
The data used this time is the Titanic passenger list
.
PassengerID:Passenger ID
Survived:Survival result(0=death, 1=Survival)
Pclass:Passenger class(1 seems to be the best)
Name:Passenger name
Sex:sex
Age:age
SibSp Number of siblings and spouses
Parch Number of parents and children
Ticket Ticket number
Fare boarding fee
Cabin room number
Embarked Port on board
Suppose you have a file called titanic_train.csv
.
** Read file **
In the pandas
library, there are many reading methods for the file format called read_xxx
, so use them to read the file. This time it's a CSV file, so it's read_csv
.
The pandas library is a library that handles data formats called tabular data frames. Load the file into the data frame.
#Read data from file
file_path = 'data/titanic_train.csv'
train_df = pd.read_csv(file_path,encoding='utf-8')
train_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | NaN | S |
The data looks like this.
Last time, I looked at the contents of various data and asked what kind of data could be used. This time, we will continue with this, using data that seems to be usable, and incorporate it into data for machine learning.
** Check for missing values **
When the data is read, if there is no data, it will be treated as a missing value on the data frame.
How many missing values are
Data frame .isnull (). Sum ()
You can check the number of missing values in each column with.
train_df.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Looking at this, it seems that only some columns have missing values. There seems to be a defect in "Age", "Cabin", and "Embarked".
Let's display only the part with missing values.
** Extract rows that match the conditions **
Data frame [conditional expression]
** Extract rows with missing values **
Data frame [Data frame ['column name']. Isnull ()]
train_df[train_df['Embarked'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38 | 0 | 0 | 113572 | 80 | B28 | NaN |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62 | 0 | 0 | 113572 | 80 | B28 | NaN |
If you look at the value in the ʻEmbarked column, it's
NaN. Missing values on the data frame are displayed as
NaN`.
In case of missing numbers ·Average value ·Median ・ Arbitrary value In many cases, missing values are complemented with.
Category values such as ʻEmbarked` are not numbers and cannot be replaced with any numbers.
If you want to fill in the missing values, you can fill them with fillna
.
** Complement missing values with arbitrary values **
Data frame .fillna (filling value)
If you want to complement with the average value of that column, first find the average value.
** Calculate column average **
Data frame ['column name']. Mean ()
** Find the median of the column **
Dataframe ['column name']. median ()
print(train_df['Fare'].mean())
print(train_df['Fare'].median())
32.2042079685746 14.4542
#Complement age with mean
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].mean())
** Vectorization of category values **
In machine learning, basically all the values used for calculation must be numerical values. In the case of category values composed of character strings, in many cases, except for some models, they cannot be used as machine learning data as they are.
Therefore, it converts the category value to a numerical value as ʻone-hot vector`.
** Make the category value a one-hot vector **
ʻOne-hot vector` is the data that creates a column of category values and sets the value to 1 if the column name is different and 0 if it is different.
pd.get_dummies (dataframe [['column name']])
#Categorification of boarding locations
train_df["Embarked"] = train_df["Embarked"].fillna('N')
one_hot_df = pd.get_dummies(train_df["Embarked"],prefix='Em')
one_hot_df.head()
Em_C | Em_N | Em_Q | Em_S | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 1 |
Since the ʻEmbarkedcolumn has a defect, it is made into a category value after replacing the defect with
N`.
A new data frame is generated with the category value replaced by 1 where it does not exist.
You will create columns for each type of data. If there are too many types of data, the data will be raw (almost 0). It is a good idea to create categorical variables only for those that are limited to some extent.
** Conversion of numbers and strings ** We will change the data that is a character string to a numerical value, or convert the numerical value to a character string to make it data that can be used for machine learning.
Since Gender (Sex)
is a character string, it cannot be used for machine learning as it is.
We will convert from a character string to a numerical value.
Data frame ['column name']. replace ({value: value, value: value ...})
#Gender quantification(0 men,1 woman)
train_df['Sex2'] = train_df['Sex'].replace({'male':0,'female':1})
train_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | NaN | S | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | NaN | S | 0 |
Created a new gender column.
You can also use replace when converting from a number to a string. When changing the entire column data type
Data frame ['column name']. Astype (np. Data type)
** Numericalization **
Let's calculate the age using age. If you divide the age by 10 to get the age. You can also create a column for missing items without age.
#Age categorization
train_df['period'] = train_df['Age']//10
train_df['period'] = train_df['period'].fillna('NaN')
train_df['period'] = train_df['period'].astype(np.str)
period_df = pd.get_dummies(train_df["period"],prefix='Pe')
period_df.head()
Pe_0.0 | Pe_1.0 | Pe_2.0 | Pe_3.0 | Pe_4.0 | Pe_5.0 | Pe_6.0 | Pe_7.0 | Pe_8.0 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
** Combine data frames **
Combine the newly created data frames into one. Use pd.concat
to put it together.
pd.concat ([dataframe, dataframe], axis = 1)
con_df = pd.concat([train_df,period_df],axis=1)
con_df = pd.concat([con_df,one_hot_df],axis=1)
con_df.head(1)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Pe_3.0 | Pe_4.0 | Pe_5.0 | Pe_6.0 | Pe_7.0 | Pe_8.0 | Em_C | Em_N | Em_Q | Em_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
You have now concatenated the data frames horizontally.
** Delete unnecessary data **
When joining the data frames, the original column is deleted because the data before conversion is not required.
Dataframe.drop (['column name'], axis = 1)
data_df = con_df.drop(['PassengerId','Pclass','Name','Age','Ticket','Cabin','Embarked','period','Sex'], axis=1)
data_df.head()
Survived | SibSp | Parch | Fare | Sex2 | Pe_0.0 | Pe_1.0 | Pe_2.0 | Pe_3.0 | Pe_4.0 | Pe_5.0 | Pe_6.0 | Pe_7.0 | Pe_8.0 | Em_C | Em_N | Em_Q | Em_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 7.25 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 71.2833 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 7.925 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | 1 | 0 | 53.1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 8.05 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
All data are numerical values. In this way, it can be used as the final data for machine learning.
Today, I processed the data and created a data mart for machine learning. We have introduced only a few processing methods today.
First, let's learn the rough flow. And once you understand it to a certain extent, I think it's a good idea to devise ways to improve the accuracy or try new methods.
21 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts