How to use machine learning for work? 03

Introduction

So far, we have delivered articles on the theme of "How to use machine learning for work?" In this third installment, we will introduce specific programming under the theme of "Python coding procedure".

If you read the back numbers, you can understand the whole process from the basics of machine learning to Python coding, so please take advantage of it.

-Part 1: Understanding the purpose of machine learning -Part 2: Overview of AI Development Project

We also send various information on SNS, so if you feel good reading the article I would be grateful if you could follow Twitter account "Saku731".

Programming skills required for machine learning

First, the skills required to learn machine learning are as follows. Let's code these one by one.

―― 1) Data visualization: Grasp the overall feeling of the data and decide the preprocessing policy -2) Data preprocessing: Clean the data so that the prediction accuracy is high. ―― 3) Algorithm selection: Determine the appropriate algorithm for the data -4) Model learning: Let the computer learn the rules of data -5) Model verification: Confirm the prediction accuracy of the completed model

Building a Python environment (you can skip it)

** Jupyter Notebook ** is required to proceed with programming in Python.

If you do not have a programming environment on your PC, please prepare by referring to the following. It is a very polite material, so even beginners can rest assured.

-For Windows users -Mac users

Preparation) Obtaining data to be used

Recently, "Titanic" data is often used in any Web service such as Kaggle. In other words, even if you have trouble studying, there are plenty of reference articles, so this article will also use Titanic data.

Please download the data from here. Now let's code a series of machine learning steps.

1) Data visualization

The purpose of data visualization is to "get a sense of the whole data and decide the preprocessing policy."

Data reading

First, check what kind of data it is.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#Read CSV data
df_train = pd.read_csv('train.csv')

#Confirmation of read CSV data
df_train.head()

If it goes well, it will be displayed as shown in the figure below. Keep in mind that this format of data is called ** DataFrame format **.

The explanation of each line is as follows. This time, it is a problem to predict "survival result", so "Survived" is the target of prediction.

--PassengerID: Passenger ID --Survived: Survival result (1: Survival, 2: Death) --Pclass: Passenger class --Name: Passenger's name --Sex: Gender --Age: Age --SibSp: Number of siblings and spouses. --Parch: Number of parents and children. --Ticket: Ticket number. --Fare: Boarding fee. --Cabin: Room number --Embarked: Port on board

Confirmation of basic statistics

Next, check the mean value of the data and the standard deviation that indicates the variation.

df_train.describe()

The points to check here are as follows. How to utilize the mean and standard deviation is a skill that you will learn after getting used to it, so I will introduce it in another article.

--mean: Mean of data --std: Standard deviation of data (degree of variation in numerical values) --min: Minimum value of data --50%: Median data --max: Maximum value of data

Handling of missing values

The next process to be performed is the ** missing value ** process. Missing values are data missing for some reason. If nothing is done, an error will occur as the program progresses, so it is necessary to deal with it at an early stage.

If you want to know more, please refer to [here] to understand the detailed programming procedure. To check if the data contains missing values, run the following code.

df_train.isnull().sum()

This time, it seems that the "Age" and "Cabin" columns contain missing values. In general, there are two types of countermeasures for missing values: "removal" and "complementation".

--Removal: If a missing value is included, delete the entire row and assume that the data itself was missing. --Complementation: Complement the missing value with some numerical value (mean, median, etc.) to make it non-missing.

It's easier to remove, so this time we'll use dropna () to remove it.

df_train = df_train.dropna()

Let's check if the missing values have been successfully removed.

df_train.isnull().sum()

In this way, we have confirmed that there are no missing values in all the rows, so we can proceed to the next process.

Draw a graph

After completing this step, draw various graphs to grasp the trends of the data.

There is no fixed pattern that will always work, but "1) Data introduced in Full disclosure of methods used in machine learning You can get an overall picture of the required skills by checking "Visualization".

For example, if you use the histogram that is most often used for graphing, you can draw the following graph.

■ Preparation: Import library

import matplotlib.pyplot as plt
%matplotlib inline

■ Graph 1: Check the number of survivors and deaths of passengers

First, we will compare the number of survivors and the number of deaths from the data. When I ran the code and checked the histogram, I found that it was ** "60 survival: 120 deaths" **.

plt.hist(df_train['Survived'], bins=3)

■ Graph 2: Draw the distribution of age If you draw the age distribution, you can see that the number of passengers aged 35 to 40 was the highest. Also, considering that there are many passengers aged 0 to 5, it is possible to infer that "** There was a parent who was on board with an infant **".

plt.hist(df_train['Age'], bins=15)

■ Graph 3: Gender distribution If you draw the distribution of gender, you can see that almost the same number of men and women were on board.

plt.hist(df_train['Sex'], bins=3)

■ Graph 4: Draw the distribution of gender x survival rate This is a little applied, but when graphing, data is often aggregated.

The overall feeling of what kind of method is available is introduced in "1) Data visualization" in Full disclosure of methods used in machine learning. However, this time I will use a method called ** cross tabulation **.

#Cross tabulation
df_survived = pd.crosstab(df_train['Sex'], df_train['Survived'])
df_survived

If you check the results of the cross tabulation, you can see that the number of survivors is clearly higher in women.

Survived = 1 is "survival".

It's a little confusing, so let's divide by the total number and cross-tabulate by ** survival rate **.

#Cross tabulation
df_survived = pd.crosstab(df_train['Sex'], df_train['Survived'], normalize='index')
df_survived

If you graph this, you can clearly see that there is a difference in survival rate between men and women. ** The survival rate of women is overwhelmingly high **, so it seems that women were given priority in rescue.

#Obtained the male-female ratio of survival rate and mortality rate

#Male-female ratio of survivors
df_survived_1 = df_survived[1].values

#Male-female ratio of fatalities
df_survived_0 = df_survived[0].values.tolist()

#Male-female ratio of survival rate
plt.bar(x=np.array(['female','male']), height=df_survived_1)

#Male-female ratio of mortality
plt.bar(x=np.array(['female','male']), height=df_survived_0)

By drawing various graphs in this way, we will grasp the characteristics of the data. There are various other methods, so if you are interested, you should study at [here].

2) Data preprocessing

After confirming the data, preprocessing is performed so that the data can be used in machine learning (data with good prediction accuracy).

Originally, various data processing is performed with reference to the information obtained by visualization, but since it is too difficult for the first study, we will deal with the simplest and most important "encoding".

Encoding of categorical variables

Next, we need to process a variable called ** categorical variable . A simple explanation of categorical variables is " character data **".

It may not always be character data, but it is sufficient for an introductory understanding.

Please try to display the data that was originally displayed again. Then, ** character data ** such as "** male " and " female **" are included.

df_train.head()

The data used in machine learning has the restriction of "** using numerical data **". Therefore, it is necessary to convert character data to numerical data in some way.

The most popular method is ** One-Hot encoding ** with get_dummies ().

# One-Hot encoding
df_train = pd.get_dummies(df_train)

#Check the converted data
df_train.head()

As shown in the figure below, the character data has been replaced with "0" and "1", so the data can be used in machine learning. If you want to understand the detailed background, please refer to the article [here].

3) Algorithm selection

What is an algorithm?

In order to extract rules from data by machine learning, ** analysis methods suitable for the data ** are required. The analysis method is called an algorithm.

There are various types of algorithms, and the typical ones are as follows. If you would like to know more about each algorithm, please refer to [here].

--Regression (numerical forecast of sales, number of store visits, etc.) --Linear regression (simple regression, multiple regression) --Regression tree --Random forest regression --Classification (prediction of choice problems like A or B) --Logistic regression --Decision tree --Random forest --Support Vector Machine (SVM) --Can be used for both --Neural network (deep learning) - XGBoost - LightGBM

Algorithm used this time

It is important to understand that different algorithms use different results. So. Let's use three types: SVM, decision tree, and random forest.

It is convenient to use sklearn because most algorithms are comprehensive.

#Support Vector Machine (SVM)
from sklearn.svm import SVC

#Decision Tree
from sklearn.tree import DecisionTreeClassifier

#Random Forest
from sklearn.ensemble import RandomForestClassifier

4) Model learning

Now that we have the algorithm to use, let's train the model. First of all, it is necessary to divide the data into training data and validation data.

The reason is that after training the model, We need a verification phase to see if the learning went well.

A method called ** holdout method ** is famous for dividing data.

--Dividing data into "explanatory variables" and "objective variables" --Each training data: Verification data = 7: 3

Split into explanatory and objective variables

First, let's divide it into "explanatory variable" and "objective variable".

--Objective variable: Target to be predicted by AI (this time it is a survival situation, so the "Survived" column) --Explanatory variable: Information used to predict the objective variable ("Non-Survived" column)

Notice the column names in the data you just displayed. "Survived" is in the second row, and the other rows are in the third and subsequent rows.

"Passenger Id: Passenger ID" is not necessary because it is just a column number.

It is convenient to use ʻiloc []` to divide DataFrame format data into explanatory variables and objective variables. For details, refer to the article [here].

#Explanatory variable
X = df_train.iloc[:, 2:]

#Objective variable
t = df_train.iloc[:, 1]

If you check the explanatory variable X, the columns after" Pclass "are correctly extracted as shown below.

X.head()

Also, if you check t, you can extract the survival status" 0/1 ".

t.head()

Training data: Verification data = 7: 3 division

Next, the explanatory variables and objective variables are divided into ** training data ** and ** validation data **. Use train_test_split from sklearn.

#Library import
from sklearn.model_selection import train_test_split

#Execution of division at 7: 3
X_train, X_valid, t_train, t_valid = train_test_split(X, t, train_size=0.7, random_state=0)

You can check the amount of data by using len (), so let's check that it is properly divided into 7: 3.

#raw data
print(len(df_train))

#Data after division
print(len(X_train), len(X_valid))

Hyperparameter settings

After preparing the algorithm and data, we need to set the number ** hyperparameter ** at the end.

Hyperparameters are settings that are responsible for fine-tuning the algorithm to fit the data.

--Algorithm: The purpose is to determine the ** analysis cut ** for the data --Hyperparameters: The purpose is to make ** fine adjustments ** so that the algorithm fits the data.

Let's set 3 hyperparameters for each algorithm so that we can understand the effects of the algorithms and hyperparameters. In other words, you will learn "3 types of algorithms x 3 types of hyperparameters = 9 types of models".


#Support Vector Machine (SVM)
model_svm_1 = SVC(C=0.1)
model_svm_2 = SVC(C=1.0)
model_svm_3 = SVC(C=10.0)

model_svm_1.fit(X_train, t_train)
model_svm_2.fit(X_train, t_train)
model_svm_3.fit(X_train, t_train)


#Decision Tree
model_dt_1 = DecisionTreeClassifier(max_depth=3)
model_dt_2 = DecisionTreeClassifier(max_depth=5)
model_dt_3 = DecisionTreeClassifier(max_depth=10)

model_dt_1.fit(X_train, t_train)
model_dt_2.fit(X_train, t_train)
model_dt_3.fit(X_train, t_train)


#Random Forest
model_rf_1 = RandomForestClassifier(max_depth=3)
model_rf_2 = RandomForestClassifier(max_depth=5)
model_rf_3 = RandomForestClassifier(max_depth=10)

model_rf_1.fit(X_train, t_train)
model_rf_2.fit(X_train, t_train)
model_rf_3.fit(X_train, t_train)

5) Model verification

Now that we have learned about 9 types of models, let's check the prediction accuracy of each model.

print('SVM_Prediction accuracy of 1:', round(model_svm_1.score(X_valid, t_valid) * 100, 2), '%')
print('SVM_Prediction accuracy of 2:', round(model_svm_2.score(X_valid, t_valid) * 100, 2), '%')
print('SVM_Prediction accuracy of 3:', round(model_svm_3.score(X_valid, t_valid) * 100, 2), '%')

print('Decision tree_Prediction accuracy of 1:', round(model_dt_1.score(X_valid, t_valid) * 100, 2), '%')
print('Decision tree_Prediction accuracy of 2:', round(model_dt_2.score(X_valid, t_valid) * 100, 2), '%')
print('Decision tree_Prediction accuracy of 3:', round(model_dt_3.score(X_valid, t_valid) * 100, 2), '%')

print('Random forest_Prediction accuracy of 1:', round(model_rf_1.score(X_valid, t_valid) * 100, 2), '%')
print('Random forest_Prediction accuracy of 2:', round(model_rf_2.score(X_valid, t_valid) * 100, 2), '%')
print('Random forest_Prediction accuracy of 3:', round(model_rf_3.score(X_valid, t_valid) * 100, 2), '%')

When I run the above code, the result looks like this: It can be confirmed that the result (prediction accuracy) changes depending on the algorithm and hyperparameters.

This time, the second decision tree has the best prediction accuracy.

When actually developing AI, you can tune hyperparameters more easily by using the ** grid search ** introduced earlier. If you want to know more, you can check the implementation method by looking at Reference article.

at the end

The above is the basic series of machine learning procedures required for AI (trained model) development. Based on this flow, it is good to understand the complicated method that can give better prediction accuracy.

If you also read the back numbers, you can understand the whole from the basics of machine learning to Python coding.

-Part 1: Understanding the purpose of machine learning -Part 2: Overview of AI Development Project

If you want to deepen your programming, please refer to the article that covers the necessary skills.

-Full disclosure of methods used in machine learning

P.S. We also send various information on SNS, so if you feel good reading the article I would be grateful if you could follow Twitter account "Saku731".

~~ Also, at the end of the sentence, we are doing "** Team Development Experience Project **" for a limited time. ~~ ~~ If you are interested, please check [Application Sheet] for details. ~~ (Addition) The deadline has been closed because it is full. The next time is scheduled for March 2019, so if you would like to be informed, please fill in [Reservation Form].

How to use machine learning for work? 03_Python coding procedure

Introduction

Programming skills required for machine learning

Building a Python environment (you can skip it)

Preparation) Obtaining data to be used

1) Data visualization

Data reading

Confirmation of basic statistics

Handling of missing values

Draw a graph

2) Data preprocessing

Encoding of categorical variables

3) Algorithm selection

What is an algorithm?

Algorithm used this time

4) Model learning

Split into explanatory and objective variables

Training data: Verification data = 7: 3 division

Hyperparameter settings

5) Model verification

at the end