Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time about ** Support Vector Machine **. Although its history is old, it is a popular technique in the field of machine learning. It has high generalization performance and was the strongest algorithm until it was broken by deep learning in ILSVRC2012. I think it's pretty good to remember this (vocabulary).

The following sites were referred to this time. Thank you very much. It seems that there are many other good books, so please refer to that as well.

Theory

I tried to understand the theory well, but it was very complicated, and in addition there was entry that touches the theory as firmly as possible. , I will describe only the essence.

Rough overview

A Support Vector Machine is a supervised binary classifier such as Perceptron or Logistic Regression. By introducing the concept of support vector, it is possible to classify classification problems (XOR, etc.) that have high generalization performance for unknown data and cannot be linearly classified by using the kernel method. Roughly speaking, it seems to be a classifier that brought margin maximization and the kernel method to Perceptron.

Theory

Consider a situation where binary classification is required for the following two features. The blue circle and the red circle are the training data, respectively. The blue dot is 1 (** positive example ) and the red dot is -1 ( negative example **). Let the green line drawn on the border be $ y = ax + b $. Of the positive and negative examples, the point closest to the green line is called the ** support vector **, and the distance from the support vector to the boundary is called the ** margin **.

The problem is that the support vector machine asks the teacher data for $ a $ and $ b $ that maximize this margin. In reality, not only are these cases convenient, but there are many cases where they are not properly divided, but first we will consider this simple case.

The reason why support vector machines can classify well in various cases is that thanks to this margin maximization, it classifies unknown data nicely, and by using data near the boundary surface, it is outlier. There seems to be a point that it does not affect the value so much.

Initial setting

N teacher data $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_ {N-1}) $, classification label $ \ boldsymbol {t} = (t_0, t_1, \ cdots, t_ { N-1}) $, the boundary expression (called the discriminant function) is $ g (x) = \ boldsymbol {w} ^ Tx + w_0 $. $ \ boldsymbol {w} $ is a weight vector of $ \ boldsymbol {x} $.

Margin maximization

y=ax+bA certain pointx_nThe distance to$\frac{|ax+b|}{\sqrt{a^2}}([reference](https://mathtrain.jp/tentotyokusen))Butthisg(x)Thinkin.Somepointx_nWheng(x)Distance|r|Is|r|=\frac{|g(x)|}{|w|}When表せる。ラベルt_nIs|t_n|=1$And,t_ng(x)>=0であるこWhenから、$|r|=\frac{t_n(\boldsymbol{w}^Tx_n+w_0)}{|w|}$Whenなる。

g(x)Closest point fromx_{min}In|r_{min}|To maximize$|r_{min}|=\frac{t_{min}(\boldsymbol{w}^Tx_{min}+w_0)}{|w|}Is the maximum\boldsymbol{w}Whenw_0$I will ask for.

Constraints

To maximize the above equation, the smaller the denominator, the better, but when it reaches 0, it becomes indefinite (the solution is not uniquely determined), so add a constraint. Assuming that $ g (x) = 1 $ in the positive example and $ g (x) = -1 $ in the negative example, $ t_ng (x) = t_n (\ boldsymbol {w} ^ Tx_n + w_0) ) \ geq 1 $. This is the boundary and the nearest neighbor pointt_ng(x_{min})=1Margin with\frac{1}{|w|}Because it maximizes|w|Should be minimized. Thinking about differentiating later,\frac{1}{2}|w|^2You can safely replace it by minimizing.

To organize,$t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1Under the condition\frac{1}{2}|w|^2To minimize\boldsymbol{w}Whenw_0$I will ask for.

Lagrange's undetermined multiplier method

To minimize a function with constraints like the above equation, use a method called Lagrange's undetermined multiplier method. This uses the theory that solving the rewritten problem means solving the original problem, which is called the ** dual problem **.

Constraint expressionh(x)=t_n(\boldsymbol{w}^Tx_n+w_0)-1Underf(x)=\frac{1}{2}|w|^2Lagrange multiplier to maximize\lambdaLagrange function that introduced$L(w, w_0, \lambda)=f(w, w_0)-\sum_{i=1}^{N}\lambda_ih(w, w_0)$Is defined.

L(w,w_0,\lambda)=\frac{1}{2}|w|^2-\sum_{i=1}^{N}\lambda_i \{ t_i(\boldsymbol{w}^Tx_i+w_0)-1\}

So, if you partially differentiate $ w $ and $ w_0 $ and set the partial derivative to 0, the Lagrange function will be an expression with only $ \ lambda $.

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m

Is derived. The constraint is

\lambda_n\geq 0 \\
\sum_{i=1}^N\lambda_it_i=0

is. This replaces the problem of finding $ \ lambda $ that maximizes $ L (\ lambda) $.

Partial differentiation of the above equation with $ w $ You can find $ w = \ sum_ {i = 1} ^ {N} \ lambda_it_ix_i $.

Now that we have $ w $, we need $ w_0 . $ g (x) = \ boldsymbol {w} ^ Tx + w_0 = \ sum_ {i = 1} ^ {N} \ lambda_it_ix_ix + w_0 $$, and the margin with the support vector is set to 1, so all support About vectors

t_n(\sum_{m} \lambda_mt_mx_mx_n+w_0)=1

Is established. Actually deformed

w_0=\frac{1}{N_M}\sum_{n}(t_n-\sum_{m}\lambda_mt_mx_mx_n)

Ask as. To summarize, after $ \ lambda $ asks

w=\sum_{i=1}^{N}\lambda_it_ix_i \\
w_0=\frac{1}{N_M}\sum_{n}(t_n-\sum_{m}\lambda_mt_mx_mx_n)

Is finally calculated. So how do you find that $ \ lambda $? Actually, it can be calculated using a method called SMO (Sequential Minimal Optimization).

SMO

First, I will repost the problem to be solved.

L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m \\

Constraints

\lambda_n\geq 0 ,\sum_{i=1}^N\lambda_it_i=0

Thus, the problem of maximizing the constrained $ L (\ lambda) $ can be found using SMO for $ \ lambda $.

[Wikipedia](https://ja.wikipedia.org/wiki/%E9%80%90%E6%AC%A1%E6%9C%80%E5%B0%8F%E5%95%8F%E9%A1 According to% 8C% E6% 9C% 80% E9% 81% A9% E5% 8C% 96% E6% B3% 95), SMO is called ** sequential minimum problem optimization method ** in Japanese. Iteratively approaches the solution like the gradient method, selecting any two variables and repeating until they converge. These arbitrary 2 variables are called Working Set, and how to select these 2 variables is the key.

As a flow,

Select a variable $ \ lambda_1 $ that violates the KKT condition
Determine $ \ lambda_2 $
Update $ \ lambda_1, \ lambda_2 $

This process is repeated until there are no variables that violate the KKT conditions.

KKT conditions

The Karush-Kuhn-Tucker condition, hereinafter the KKT condition, refers to the optimal condition that the first-order derivative must satisfy.

Actually, the KKT condition is also used in the above Lagrange undetermined multiplier method.

(1) \frac{\partial{L}}{\partial{w}}=0 (2) \frac{\partial{L}}{\partial{w_0}}=0 (3) t_n(\boldsymbol{w}^Tx_n+w_0) \geq 1 (4) \lambda_n \geq 0 (5) \lambda_n(t_n(\boldsymbol{w}^Tx_n+w_0)-1) = 0

It is a condition. In particular, the fifth condition is called the ** complementarity condition **.

Check for KKT condition violations and determine one variable

Complementarity condition $ \ lambda_n (t_n (\ boldsymbol {w} ^ Tx_n + w_0) -1) = 0 $ (1) If $ \ lambda_n = 0 $, $ t_n (\ boldsymbol {w} ^ Tx_n + w_0) \ geq 1 $ (2) If $ \ lambda_n> 0 $, $ t_n (\ boldsymbol {w} ^ Tx_n + w_0) = 1 $

Will be derived, so you can choose $ \ lambda $ that does not meet this. On the contrary, suppose that the solution is found when $ \ lambda $ disappears. Let $ \ lambda $ selected here be $ \ lambda_1 $.

Determine another variable

Select the other variable in the following order. First, you need to find $ \ boldsymbol {w} $ and $ w_0 $ in your current $ \ lambda . The error function at this time is $ E_n = (\ boldsymbol {w} ^ Tx_n + w_0) -t_n $$ will do.

(1) Select to maximize the amount of variable updates Select $ n $ that maximizes the error from $ E_1 $ in the case of: $ \ lambda_1 $.
(2) Does not exist on the boundary: $ t_n (\ boldsymbol {w} ^ Tx_n + w_0) = any point that is not 1 $
(3) Remaining

Update variables

I decided to update $$ lambda_1 $ and $ \ lambda_2 , but when updating, there is a linear constraint $ \ sum_ {i = 1} ^ N \ lambda_it_i = 0 $$, so update under this condition There is a need. That is, if you update one $ \ lambda $, you have to adjust the other $ \ lambda $. Let $ \ lambda_1 $ and $ \ lambda_2 $ after the update be $ \ lambda_1 ^ {new} $ and $ \ lambda_2 ^ {new} $, respectively.

\lambda_1^{new}t_1+\lambda_2^{new}t_2=\lambda_1t_1+\lambda_2t_2

It will be. This is divided into the case of $ t_1 = t_2 $ and the case of $ t_1 \ ne t_2 $.

\lambda_1^{new}=\lambda_1+\frac{t_1(E_2-E_1)}{x_1^2+x_1x_2+x_2^2} \\
\lambda_2^{new}=\lambda_2+t_1t_2(\lambda_1-\lambda_1^{new})

To get Actually, I needed a process called clipping to find the possible range of $ \ lambda_1 $ and $ \ lambda_2 $, but I was exhausted.

python implementation

First of all, I've exhausted my theory, and it's been too long, so I'll just implement scikit-learn. sorry. I will definitely implement it by myself someday.

Implementation of scikit-learn

The implementation of python is pretty straightforward. Use sklearn.svm.LinearSVC to do classification using support vector machines in scikit-learn. .. This time, for the sake of clarity, we will use a sample in which 50 positive examples and 50 negative examples are properly separated.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn import svm

fig, ax = plt.subplots()    
    
x1_1=np.ones(50)+10*np.random.random(50)
x1_2=np.ones(50)+10*np.random.random(50)
x2_1=-np.ones(50)-10*np.random.random(50)
x2_2=-np.ones(50)-10*np.random.random(50)

x1 = np.c_[x1_1,x1_2]
x2 = np.c_[x2_1,x2_2]
y  = np.array(np.r_[np.ones(50), -np.ones(50)])

model = svm.LinearSVC()
model.fit(np.array(np.r_[x1,x2]), y)

ax.scatter(x1[:,0],x1[:,1],marker='o',color='g',s=100)
ax.scatter(x2[:,0],x2[:,1],marker='s',color='b',s=100)

w = model.coef_[0]

x_fig = np.linspace(-12.,12.,100)
y_fig = [-w[0]/w[1]*xi-model.intercept_/w[1] for xi in x_fig]
ax.plot(x_fig, y_fig)

plt.show()

It was classified like this. It's natural.

Summary

I'm exhausted at the end, but I think I could grasp the atmosphere (I can't grasp it w). This time, we have derived the basics in the basics, linearly separable and examples (hard margin) that can properly classify positive and negative examples.

For more advanced or realistic support vector machines, it is necessary to consider cases where linear separation is not possible (kernel method) and cases where positive and negative examples cannot be properly classified (soft margin), but the basic theory Is almost the same as this time.

From the next time onward, I would like to deal with that area.

** Additional **: Written

Machine learning algorithm (support vector machine)

Introduction

Theory

Rough overview

Theory

Initial setting

Margin maximization

Constraints

Lagrange's undetermined multiplier method

KKT conditions

Check for KKT condition violations and determine one variable

Determine another variable

Update variables

python implementation

Implementation of scikit-learn

Summary