Introduction to Python scikit-learn, matplotlib, single-layer algorithm (~ towards B3 ~ part3)

In part3, we will do simple machine learning.

Machine learning is divided into three types.

type	Feature
Supervised learning	Data: Labeled Purpose: Outcome prediction and future prediction Example: Email filter,Stock price forecast
Unsupervised learning	Data: unlabeled Purpose:Find hidden structures in your data Example: Customer segmentation,Anomaly detection
Reinforcement learning	Data: decision-making process Purpose: Learn a series of actions, etc. AlphaGo,Robot discipline

Machine learning flow

Even if you do machine learning, it is rare that you can get good results just by creating a learning model and inputting data. The specific workflow is as follows.

Data preparation
Determine the content of the data you want the model to train and collect it.
Model selection
Select a suitable model from the data format and characteristics.
There is no model that is optimal for all data.
(No free lunch theorem), so find a model that is suitable for your data.
If it is a prediction model, the evaluation index is the correct answer rate.
Pretreatment
In most cases, learning does not go well with raw data, so we will format the data.
Completion of missing data
-Extraction of features
-Same scale of features contained in data
-Dimensionality reduction to compress irrelevant features (noise)
-Split data for learning and evaluation
Model training
We will train with the selected model.
However, there are parameters (hyperparameters) that must be adjusted manually in the model, so we will also adjust them.
Model evaluation
The model is evaluated (generalization performance is evaluated) using the data left for evaluation.

By repeating this process, a suitable learning model will be generated.

In this seminar, we will do 3-5. However, since 3 was mostly done in part2, we will focus on 4 and 5 for the models to be introduced from now on.

Neuron

What exactly is machine learning? Warren McCulloch and Walter Pitts wondered if they could imitate the human brain to design artificial intelligence. Therefore, we announced a simplified version of the neuron, which is the smallest unit of the human brain, as the McCulloch-Pitts neuron.

Neurons receive chemical and electrical signals in the brain and generate output signals when the accumulated signals exceed a certain threshold. They considered it as a logic gate for binary output and designed MCP neurons.

A few years later, Frank Rosenblatt devised an algorithm that automatically learned the optimal weighting factor and then multiplied it with the input signal to determine if the neuron would fire (exceed the threshold). This is the beginning of the "classification problem" of "supervised learning".

The definition of artificial neurons is as follows. Let the weight vector for multiple input signals $ \ boldsymbol {x} $ be $ \ boldsymbol {w} $. Put a linear combination of each input $ x_i $ and its weight $ w_i $ as the total input $ z $.

z = \Sigma_i^Nw_i x_i=\boldsymbol{w}^T\boldsymbol{x}

Divide the output value of binary classification into 1 (positive class) and -1 (negative class). Define a function $ \ phi $ (decision function) that classifies the total input $ z $ into a positive class if it is greater than the threshold $ \ theta $, and a negative class otherwise.

\phi(z) = \Biggl\{\begin{array}{l} 1  　(z \geq\When theta) \\-1　(z < \When theta)\end{array}

Here, for the sake of simplicity, we move the threshold $ \ theta $ to the left side and define it as $ w_0 =-\ theta $, $ x_0 = 1 $. The negative threshold of $-\ theta $ at this time is called the bias unit.

\phi(z) = \Biggl\{\begin{array}{l} 1  　(z \When geq0) \\-1　(z <When 0)\end{array}

Weight update (learning)

Learning machine learning refers to updating weights. The learning procedure is summarized below.

Initialize the weight with 0 or a small random number.
Perform the following procedure for each training sample. -① Calculate the output value $ \ hat {y} $ -② Update weight

To update the weight $ w_j $ for an input $ x_j $ as follows:

w_j = w_j + \Delta w_j

\Delta w_j = \eta(y^{(i)}-\hat{y}^{(i)})x_j

However, $ \ eta $ is the learning rate, and the larger this value, the more Each training sample has a greater impact on weight updates. The difference $ (y ^ {(i)}-\ hat {y} ^ {(i)}) $ between the correct class and the output label is called the error.

Now let's implement the perceptron.

class Perceptron(object):
    def __init__(self,eta=0.01,n_iter=50,random_state=1):
        #Definition of learning rate
        self.eta=eta
        #Definition of training frequency
        self.n_iter=n_iter
        #Random seed used to initialize weights
        self.random_state=random_state
    def fit(self,X,y):
        #Random number generation
        rgen=np.random.RandomState(self.random_state)
        #step1 Initialization of weight
        self.w_=rgen.normal(loc = 0.0,scale=0.01,size=1+X.shape[1])
        #Declaration of error
        self.errors_=[]
        #Perform for the number of trainings
        for _ in range(self.n_iter):
            #Error initialization
            errors=0
            #step2 Execute for each training sample
            for xi, target in zip(X,y):
                #Output value calculation and delta_calculation of w
                udelta_w = self.eta * (target - self.predict(xi))
                #Weight update
                self.w_[1:] += delta_w * xi
                self.w_[0] += delta_w
                errors += int(delta_w != 0.0)
            self.errors_.append(errors)
        return self
    
    #Definition of total input
    def net_input(self,X):
        return np.dot(X, self.w_[1:]) + self.w_[0]
    #Definition of decision function
    def predict(self,X):
        return np.where(self.net_input(X) >= 0.0,1,-1)

Let's use this perceptron to predict the Iris data used in part2.

import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
y = df.iloc[0:100,4].values
y = np.where(y == 'Iris-setosa',-1,1)
X = df.iloc[0:100,[0,2]].values
ppn = Perceptron(eta=0.01,n_iter=10)
ppn.fit(X,y)
plt.plot(range(1,len(ppn.errors_)+1), ppn.errors_,marker='o')
plt.xlabel('Epochs')
plt.ylabel('Number of errors')
plt.show()

After 10 training sessions like this, the misclassification finally disappeared. The model learned the dataset by updating the weights.

** Supplement ** If this is illustrated, it will be such a decision area.

from matplotlib.colors import  ListedColormap

def plot_decision_regions(X,y,classifier,resolution=0.02):
    markers = ('s','x','o','^','v')
    colors    = ('red','blue','lightgreen','gray','cyan')
    cmap     = ListedColormap(colors[:len(np.unique(y))])

    
    x1_min, x1_max = X[:,0].min() -1,X[:,0].max()+1
    x2_min, x2_max = X[:,1].min() -1,X[:,1].max()+1
    xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,resolution),np.arange(x2_min,x2_max,resolution))
    Z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)
    plt.contourf(xx1,xx2,Z,alpha=0.3,cmap = cmap)
    plt.xlim(xx1.min(),xx1.max())
    plt.ylim(xx2.min(),xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x = X[y == cl,0],y = X[y == cl,1],alpha = 0.8, c = colors[idx],marker=markers[idx],label=cl,edgecolor='black')


plot_decision_regions(X,y,classifier=ppn)
plt.xlabel('sepal length[cm]')
plt.ylabel('petal lemgth[cm]')
plt.legend(loc='upper left')

plt.show()

Challenges

Let's see how the learning of the model changes by changing the learning rate $ \ eta $ and the number of trainings.

Scikit-learn Scikit-learn is a module that contains many simple classification algorithms. It also contains the perceptron that I implemented earlier. Let's try it out.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
import numpy as np
#Acquisition of iris data, selection of data to use
iris = datasets.load_iris()
X = iris.data[:,[2,3]]
y = iris.target
#Divided into test data and train data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
#Calling Perceptron
ppn = Perceptron(n_iter_no_change=500,eta0=0.1,random_state=1)
#Train the model to train the data
ppn.fit(X_train,y_train)
#Validate the trained model
y_predict = ppn.predict(X_test)
#Display of the number of misclassifications
print('Misclassified samples: %d' %(y_test != y_predict).sum())
#Display of correct answer rate
print('Accuracy: %.2f' %ppn.score(X_test,y_test))

Misclassified samples: 7
Accuracy: 0.84

Challenges

Use the plot_decision_regions function used in the above supplement to illustrate the decision regions this time.

image.png

Logistic regression

What kind of decision area has it become? Perceptron is a model that performs linear separation, so linear separation is not possible Not suitable for datasets. So let's look at an algorithm called logistic regression. Logistic regression uses an odds ratio.

\frac{p}{(1-p)}

The odds ratio is the ratio that indicates the likelihood of an event. $ p $ indicates the probability of the event you want to predict. The odds ratio multiplied by the natural logarithm is called the logit function.

logit(p)=log{\frac{p}{1-p}}

By using this function, such a relationship is established between the feature quantity and the log odds ratio.

logit(p(y=1|x))=\boldsymbol{w}^T \boldsymbol{x}

p (y = 1 | x) is the conditional probability that the sample belongs to class 1 (y = 1) given the feature $ x $. This time I want to predict the probability that the sample belongs to a specific class, so

p(y=1|x) = \phi(z) = logit^{-1}(\boldsymbol{w}^T \boldsymbol{x})=\frac{1}{1+e^{-z}}

It can be expressed like this. In other words, the logistic sigmoid function (commonly known as the sigmoid function) is bitten before the decision function. The weight is also updated using the output value after applying the logistic sigmoid as prediction data. The function that gets between the total input and the decision function in this way is called the ** activation function **. Since the perceptron updated the weight by comparing the output of the decision function with the true value, The difference used when updating the weight was a discrete value. By inserting the activation function, the predicted value $ \ hat {y} $ used for updating the weight becomes a continuous value.

Let's actually use logistic regression.

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=100.0,random_state=1)

Challenges

Run the above code and then learn Iris data with logistic regression Let's also illustrate the post-learning decision area

image.png

Support Vector Machine (SVM)

At Perceptron, the goal was to minimize the misclassification rate. In contrast, Support Vector Machines (SVMs) The purpose is to maximize the ** margin **. The margin is the decision boundary of the classification and the closest to the decision boundary. It refers to the distance from the training sample. The above training sample is called ** support vector **. By adjusting the decision boundaries to maximize the margin A strong decision boundary is generated as a classifier.

Let $ y = f (\ boldsymbol {x}) $ be the classification function that classifies the input $ \ boldsymbol {x} $ into two classes. Suppose you have $ n $ of learning samples $ (\ boldsymbol {x_1}, y_1), (\ boldsymbol {x_2}, y_2), ..., (\ boldsymbol {x_m}, y_m) $. Here we define the linear classifier $ f (\ boldsymbol {x}) = sgn [\ boldsymbol {w} \ bullet \ boldsymbol {x} + b] $. This returns 1 when $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b \ geq0 $, and -1 when $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b <0 $ Returns. At this time, $ \ boldsymbol {w} \ bullet \ boldsymbol {x} + b = 0 $ is called a hypersurface. The distance between the hypersurface and the closest sample (support vector) is

\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}+b|}{||\boldsymbol{w}||}

It will be. The formula for maximizing the margin is as follows.

\max_{\boldsymbol{w},b}\min_{i}\{\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}|+b}{||\boldsymbol{w}||}\}

Since the hypersurface is invariant even if it is multiplied by a constant, for all samples

y_i(\boldsymbol{a}\bullet\boldsymbol{x_i}+b) \geq 1

Can be assumed. At this time, the margin is

\min_{i}\{\frac{|\boldsymbol{w}\bullet\boldsymbol{x_i}|+b}{||\boldsymbol{w}||}\}=\frac{1}{||\boldsymbol{w}||}

In other words\frac{1}{||\boldsymbol{w}||}Should be maximized. this is||\boldsymbol{w}||^2Is synonymous with minimizing. This can be solved by quadratic programming.

This maximum margin classification is valid only when linear separation is possible. In other cases, we will introduce a slack variable as a workaround.

Linear convention formula

y_i(\boldsymbol{w}\bullet\boldsymbol{x_i}+b) \geq 1

Introduce the slack variable $ \ xi $ for. This allows for acceptable but costly samples that do not meet the constraints, and also addresses non-linear problems.

y_i(\boldsymbol{w}\bullet\boldsymbol{x_i}+b) \geq 1-\xi

By introducing a slack variable, the maximum margin classification can be described as follows.

\frac{1}{2}||\boldsymbol{w}||^2+C(\Sigma_{i}{\xi^{(i)})}

We will use this variable $ C $ to control the misclassification penalty. The larger $ C $, the larger the penalty and the narrower the margin width. This type of maximum margin classification is called soft-margin classification. The maximum margin classification before the introduction of $ \ xi $ is called hard margin classification.

This time, we will use Soft-Margin Classification (SVC).

from sklearn.svm import SVC
svm = SVC(kernel='linear',C = 1.0,random_state=1)
svm.fit(X_train,y_train)

Challenges

Run the above code and then learn Iris data with SVC Let's also illustrate the post-learning decision area

image.png

There is an effective method for the nonlinear classification problem called ** kernelization **. SVMs are easier to kernel than other classification algorithms. This makes SVM a popular classification method. Let's look at an example of non-linear data for which kernelization is enabled.


import matplotlib.pyplot as plt
import numpy as np

np.random.seed(1)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
                       X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)

plt.scatter(X_xor[y_xor == 1, 0],
            X_xor[y_xor == 1, 1],
            c='c', marker='x',
            label='1')
plt.scatter(X_xor[y_xor == -1, 0],
            X_xor[y_xor == -1, 1],
            c='m',
            marker='*',
            label='-1')

plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.legend(loc='best')
plt.tight_layout()

plt.show()

In such a case, it is impossible to separate them with a single straight line. However, projecting this data to a higher dimension changes the appearance of the data.

Projection function $ \ phi (x_1, x_2) = (z_1, z_2, z_3) = (x_1, x_2, x_1 * x_2) $

Challenges

Let's plot the data in 3D as $ z = \ phi (x, y) $.

Plotted video

np.random.seed(1)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
                       X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 1, -1)
svm = SVC(kernel='rbf', random_state=1, gamma=0.10, C=10.0)
svm.fit(X_xor, y_xor)
plot_decision_regions(X_xor, y_xor,
                      classifier=svm)

plt.legend(loc='upper left')
plt.tight_layout()

plt.show()

svm = SVC(kernel='rbf',random_state=1,gamma=0.20,C = 10.0)
svm.fit(X_train,y_train)
plot_decision_regions(X_combined,y_combined,classifier=svm)

plt.xlabel('petal length')
plt.ylabel('sepal length')
plt.legend(loc = 'upper left')
plt.tight_layout()
plt.show()

Challenges

Let's check how the decision area changes by changing the value of gamma

Decision tree learning

The decision tree classifier is a model that can be effective when considering the interpretability of meaning. This classifier makes decisions based on a series of questions and classifies the data. In explaining the decision tree classifier, we will first explain the index ** information gain **. Information gain refers to the reduction of variability in the elements of each set when a set is divided. The decision tree forms a tree with many leaves by conditional branching until the information gain disappears. If a tree is completely created for train data, it is easy to fall into overfitting (a model that adapts too much only to train data). Stop growing branches to a depth. This is called ** pruning **.

The objective function of the decision tree learning algorithm is as follows.

IG(D_p,f)=\boldsymbol{I}(D_p)-\Sigma^{m}_{j=1}\frac{N_j}{N_p}\boldsymbol{I}(D_j)

Where $ f $ is the feature to split, $ D_p $ is the parent dataset, $ N_p $ is the total number of samples of the parent node, $ D_j $ is the jth child dataset, $ N_j $ is the total number of jth child nodes, and $ \ boldsymbol {I} $ is the impureness. In other words, the information gain is the difference between the purity of the parent node and the purity of the child node, and is an index to quantify the proportion of samples of different classes mixed in the node. .. The most commonly used indicators of purity

--Gini Impure --Entropy --Classification error

There are three. Among them, Gini Impure indicates the ratio of samples in which $ p (i | t) $ belongs to the class $ i $ to the special node $ t $. Therefore, Gini impureness is a condition that minimizes the probability of misclassification.

I_G(t)=\Sigma^c_{i=1}p(i|t)(1-p(i|t)) = 1-\Sigma^c_{i=1}p(i|t)^2

The decision tree learns by what value the sample data is divided into conditional branches with less misclassification.

Let's classify the Iris data with a decision tree classifier with a decision tree depth of 4.

from sklearn.tree import DecisionTreeClassifier
tree  = DecisionTreeClassifier(criterion='gini',max_depth=4,random_state=1)

Challenges

Run the above code and then learn the Iris data in the decision tree Let's also illustrate the post-learning decision area Let's try changing the depth (max_depth) of the decision tree and see how the decision area changes.

image.png

Random forest

The Random Forest algorithm is the idea of creating a model with higher generalization performance (which can handle any data) by averaging multiple deep decision trees. Combining multiple algorithms in this way is called ** ensemble learning **. It is common to combine different models.

Random forest is performed by the following procedure.

Randomly select this sample from the training data.
Grow a decision tree based on the selected sample dataset. 3.2 Repeat 2
Collect the prediction labels for each decision tree and assign class labels based on the ** majority vote **.

Now let's classify the Iris data in a random forest.

from sklearn.ensemble import RandomForestClassifier
forest  = RandomForestClassifier(n_estimators=25,criterion='gini',random_state=1,n_jobs=2)

Challenges

Run the above code and then learn Iris data in Random Forest Let's also illustrate the post-learning decision area

image.png

k-nearest neighbor method

The k-nearest neighbor method is also called lazy learning, in which the dataset is memorized and classified without learning from the training dataset. The k-nearest neighbor algorithm takes the following steps:

Select the value of k and the distance index
Find k nearest neighbor data from the samples you want to classify.
Assign a class label by majority vote from the class label of the neighborhood data.

The advantage of this method is that it does not require learning and is immediately put into the classification stage. However, it should be noted that if the amount of training data is too large, the amount of calculation will be enormous.

Let's actually classify using the k-nearest neighbor method.

from sklearn.neighbors import KNeighborsClassifier
knn =  KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski')

Challenges

Execute the above code and then learn the Iris data by the k-nearest neighbor method Let's also illustrate the post-learning decision area

image.png

Summary

model	merit
Logistic regression	Can predict the probability that an event will occur
SVM	Can handle non-linear problems with kernel tricks
Decision tree	Consideration can be given to the interpretability of meaning
Random forest	Not many parameter adjustments Overfitting does not occur as much as a decision tree
k-nearest neighbor method	No training required

Challenges

Let's use titanic data to predict whether or not we can survive! The column to use this time 'Passenderld','Age','Pclass','Sex','FamilySize' will do.