Machine learning algorithm (support vector machine application)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

Last time, I wrote about the basics of support vector machines. Last time, we dealt with SVM, which is called hard margin and can separate positive and negative cases properly, but this time

I would like to mention about.

Soft margin SVM

Consider an example where the red and blue circles cannot be subtly separated as shown in the figure below, unlike the previous time. svm_advance_1.png

Review before that

The hard margin SVM expression is a set of parameterswWhen$ \frac{1}{2}|w|^2To$ t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1$$という制約条件化で最小化するという問題でした。ソフトマージンはこの制約条件To緩めた問題に変えます。

Relaxing constraints

Introduce the slack variable $ \ xi $ and the parameter $ C $ to relax the condition. The slack variable is a variable of how much error is allowed between the support vector and the boundary line, and $ C (> 0) $ indicates the strictness of the constraint condition. When these are introduced, the problem to be solved described above changes as follows.

\frac{1}{2}|w|^2+C\sum_{i=1}^{N}\xi_i \\
t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1-\xi_n \\
\xi_n \geq 0

Regarding the relationship between $ C $ and $ \ xi $, if $ C $ is large, it cannot be minimized unless $ \ xi $ is small, and if $ C $ is small, it can be minimized even if $ \ xi $ is large to some extent. It means that. If $ C $ is infinite, this is the same as a hard margin SVM, as $ \ xi $ can only tolerate zero (= not allow data within the margin).

Lagrange undetermined multiplier solution

Unlike the hard margin, the constraint has been increased to two, so the Lagrange multiplier is also set to $ \ lambda $ and $ \ mu $.

L(w,w_0,\lambda, \mu)=\frac{1}{2}|w|^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^{N}\lambda_i \{ t_i(\boldsymbol{w}^Tx_i+w_0)+\xi_i-1\}-\sum_{i=1}^n\mu_i\xi_i

If this is partially differentiated with respect to $ w $, $ w_0 $, and $ \ xi $, and each is set to zero,

w=\sum_{i=1}^n\lambda_it_ix_i \\
\sum_{i=1}^n\lambda_it_i=0 \\
\lambda_i=C-\mu_i

Can be obtained, and when assigned to the Lagrange function,

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m

And this is exactly the same formula as for the hard margin. However, the constraint is

\sum_{i=1}^n\lambda_it_n=0 \\
0 \leq \lambda_i \leq C

It will be. As with the hard margin, it is possible to find the parameters using SMO. (Omitted this time)

Kernel method and kernel tricks

Consider the following example, which seems to be inseparable by a straight line.

svm_advance_2.png.png

If it has such a shape, move the point to a higher dimensional space and then separate the planes, such as 2D → 3D. The method of converting to a higher order is called the ** kernel method **, and the function for converting is called the ** kernel function **.

Basis set

A data string that projects $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_ {n-1}) $ is $ \ boldsymbol {\ phi} = \ {\ phi_0 (\ boldsymbol {x) }), \ phi_1 (\ boldsymbol {x}), \ cdots, \ phi_ {m-1} (\ boldsymbol {x}) \} $. This $ \ phi (x) $ is called a ** basis set **. The basis function was equivalent to $ \ phi (x) = x $ because the previous SVM was able to handle linear separation. Other commonly used basis functions are the polynomial $ \ phi (x) = x ^ n $ and the Gaussian basis $ \ phi (x) = \ exp \ left \\ {-\ frac {(x-). There is \ mu) ^ 2} {2 \ sigma ^ 2} \ right \\} $.

By applying the basis function, the $ x_n ^ Tx_m $ part of the Lagrange function changes to $ \ phi (x) _n ^ T \ phi (x) _m $.

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_m\phi(x)_n^T\phi(x)_m

This $ \ phi (x) _n ^ T \ phi (x) _m $ is an inner product calculation, and if there are many data points, the amount of calculation will be enormous, so we will devise a little.

Kernel functions and kernel tricks

In fact, $ \ phi (x) _n ^ T \ phi (x) _m $ can be replaced with $ k (x_n, k_m) $. $ k (x_n, k_m) $ is called a ** kernel function **. By replacing it in this way, you can omit the troublesome inner product calculation. This is called a ** kernel trick **. For details, refer to "Kernel Trick".

In particular, the kernel function using the Gaussian basis function mentioned above is called ** RBF kernel (Radial basis function kernel) **.

Finally the Lagrange function

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mk(x_n,x_m) \\
\text{subject.to }\sum_{i=1}^n\lambda_it_n=0,0 \leq \lambda_i \leq C

It will be. Actually, solve this formula, find $ \ lambda $, then $ \ boldsymbol {w} $ or $ w_0 $.

Try it with python

Last time, I used a simple sklearn.svm.LinearSVC to classify, but the more general [sklearn] Try using .svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

View API documentation

Looking at the explanation of the API, it is as follows.

class sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

If you understand the contents so far, you will gradually be able to understand this explanation. The kernel parameter is the parameter that determines the basis function, which is linear for linear and rbf for Gaussian kernel. The important ones here are C and gamma. C is a parameter that determines the strength of the constraint, and the larger it is, the more stringent the constraint becomes. gamma is a parameter that determines the spread of Gaussian basis functions and is the reciprocal, so the smaller it is, the smoother it becomes.

Try to implement

Use the data shown first as the data to be classified. Actually, this data uses an API called sklearn.datasets.make_moons. You can specify the number of samples and the strength of noise. The decision boundaries are also illustrated. Since the decision boundary is not linear, it is drawn as a contour line. Specifically, we use a function called contourf in matplotlib.

import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

%matplotlib inline

X, y = make_moons(n_samples=200,
                  shuffle = True,
                  noise = 0.1,
                  random_state = 2020,)
 
a0, b0 = X[y==0,0], X[y==0,1]
a1, b1 = X[y==1,0], X[y==1,1]

model = svm.SVC(C=1.0, kernel='rbf', gamma=1)
model.fit(X, y)

x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1

xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
                        np.arange(x2_min,x2_max,0.02))

Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
Z = Z.reshape(xx1.shape)

plt.figure(figsize=(8, 7))
 
plt.contourf(xx1,xx2,Z,alpha = 0.4)
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())

plt.scatter(a0, b0, marker='o', s=25, label="y = 0")
plt.scatter(a1, b1, marker='o', s=25, label="y = 1")
plt.legend()
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()
svm_advance_3.png

It seems that they can be separated. With the API, you can also get support vectors, but even if you compare it with the actual number of data, you can approximate it with a small amount of data, which contributes to memory saving and faster calculation.

Hyperparameter adjustment

In the above, I decided C and gamma appropriately, but what if I change this? Let's actually draw it.

list_C = [0.1, 1, 20]
list_gamma = [0.05, 0.5, 20]

x1_min,x1_max = X[:,0].min() - 0.1,X[:,0].max() + 0.1
x2_min,x2_max = X[:,1].min() - 0.1,X[:,1].max() + 0.1

xx1,xx2 = np.meshgrid(np.arange(x1_min,x1_max,0.02),
                        np.arange(x2_min,x2_max,0.02))

plt.figure(figsize=(11, 11))
plt.xlim(xx1.min(),xx1.max())
plt.ylim(xx2.min(),xx2.max())
plt.xlabel("x1")
plt.ylabel("x2")

for i in range(len(list_C)):
  for j in range(len(list_gamma)):
    model = svm.SVC(C=list_C[i], kernel='rbf', gamma=list_gamma[j])
    model.fit(X, y)

    Z = model.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)


    ax = plt.subplot(len(list_C), len(list_gamma), i*len(list_C)+j+1)
    ax.set_title("C={}, gamma={}".format(list_C[i], list_gamma[j]))
    ax.contourf(xx1,xx2,Z,alpha = 0.4)

    ax.scatter(a0, b0, marker='o', s=25, label="y = 0")
    ax.scatter(a1, b1, marker='o', s=25, label="y = 1")

plt.show()

The result is as follows.

svm_advance_4.png

As explained above, the larger the C, the better the separation, and the larger the gamma, the more complex the curve. However, it seems that it is overfitting at the bottom right, and it seems that parameter tuning is required.

In order to tune the parameters, it is necessary to divide the sample into training data and verification data and search for parameters that have a high degree of agreement when predicted by the verification data. This is called ** Cross Validation **, but I'll try to summarize it at another time.

Summary

The support vector machine has been extended from hard margin to soft margin to handle non-linear separation. Looking at it in this way, I felt that I could reject even quite complicated classifications. It is understandable that it was popular before neural networks.

Recommended Posts

Machine learning algorithm (support vector machine application)
Machine learning algorithm (support vector machine)
Machine learning support vector machine
Machine Learning: Supervised --Support Vector Machine
Machine learning ① SVM (Support Vector Machine) Summary
<Course> Machine Learning Chapter 7: Support Vector Machine
Machine learning algorithm (simple perceptron)
Machine learning algorithm (logistic regression)
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Machine learning
Machine learning algorithm (gradient descent method)
Application development using Azure Machine Learning
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
Machine learning algorithm (implementation of multi-class classification)
Machine learning algorithm classification and implementation summary
[Python] Web application design for machine learning
Support Vector Machine (for beginners) -Code Edition-
Machine learning algorithm (linear regression summary & regularization)
Dictionary learning algorithm
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Gaussian mixed model EM algorithm [statistical machine learning]
Calculation of support vector machine (SVM) (using cvxopt)
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Tool MALSS (application) that supports machine learning in Python
Machine learning logistic regression
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Build a machine learning application development environment with Python
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Tech-Circle Let's start application development using machine learning (self-study)
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Talk about improving machine learning algorithm bottlenecks with Cython
Python Machine Learning Programming Chapter 2 Classification Problems-Machine Learning Algorithm Training Summary
Support Kaggle / MNIST Do your best with a vector machine
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.