Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
This time about ** Support Vector Machine **. Although its history is old, it is a popular technique in the field of machine learning. It has high generalization performance and was the strongest algorithm until it was broken by deep learning in ILSVRC2012. I think it's pretty good to remember this (vocabulary).
The following sites were referred to this time. Thank you very much. It seems that there are many other good books, so please refer to that as well.
I tried to understand the theory well, but it was very complicated, and in addition there was entry that touches the theory as firmly as possible. , I will describe only the essence.
A Support Vector Machine is a supervised binary classifier such as Perceptron or Logistic Regression. By introducing the concept of support vector, it is possible to classify classification problems (XOR, etc.) that have high generalization performance for unknown data and cannot be linearly classified by using the kernel method. Roughly speaking, it seems to be a classifier that brought margin maximization and the kernel method to Perceptron.
Consider a situation where binary classification is required for the following two features. The blue circle and the red circle are the training data, respectively. The blue dot is 1 (** positive example ) and the red dot is -1 ( negative example **). Let the green line drawn on the border be
The problem is that the support vector machine asks the teacher data for $ a $ and $ b $ that maximize this margin. In reality, not only are these cases convenient, but there are many cases where they are not properly divided, but first we will consider this simple case.
The reason why support vector machines can classify well in various cases is that thanks to this margin maximization, it classifies unknown data nicely, and by using data near the boundary surface, it is outlier. There seems to be a point that it does not affect the value so much.
N teacher data $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_ {N-1}) $, classification label $ \ boldsymbol {t} = (t_0, t_1, \ cdots, t_ { N-1}) $, the boundary expression (called the discriminant function) is $ g (x) = \ boldsymbol {w} ^ Tx + w_0 $. $ \ boldsymbol {w} $ is a weight vector of $ \ boldsymbol {x} $.
To maximize the above equation, the smaller the denominator, the better, but when it reaches 0, it becomes indefinite (the solution is not uniquely determined), so add a constraint.
Assuming that $ g (x) = 1 $ in the positive example and $ g (x) = -1 $ in the negative example,
To organize,
To minimize a function with constraints like the above equation, use a method called Lagrange's undetermined multiplier method. This uses the theory that solving the rewritten problem means solving the original problem, which is called the ** dual problem **.
Constraint expression
L(w,w_0,\lambda)=\frac{1}{2}|w|^2-\sum_{i=1}^{N}\lambda_i \{ t_i(\boldsymbol{w}^Tx_i+w_0)-1\}
So, if you partially differentiate $ w $ and $ w_0 $ and set the partial derivative to 0, the Lagrange function will be an expression with only $ \ lambda $.
L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m
Is derived. The constraint is
\lambda_n\geq 0 \\
\sum_{i=1}^N\lambda_it_i=0
is. This replaces the problem of finding $ \ lambda $ that maximizes $ L (\ lambda) $.
Partial differentiation of the above equation with $ w $
You can find
Now that we have $ w $, we need $ w_0
t_n(\sum_{m} \lambda_mt_mx_mx_n+w_0)=1
Is established. Actually deformed
w_0=\frac{1}{N_M}\sum_{n}(t_n-\sum_{m}\lambda_mt_mx_mx_n)
Ask as. To summarize, after $ \ lambda $ asks
w=\sum_{i=1}^{N}\lambda_it_ix_i \\
w_0=\frac{1}{N_M}\sum_{n}(t_n-\sum_{m}\lambda_mt_mx_mx_n)
Is finally calculated. So how do you find that $ \ lambda $? Actually, it can be calculated using a method called SMO (Sequential Minimal Optimization).
SMO
First, I will repost the problem to be solved.
L(\lambda)=\sum_{n=1}^{N}\lambda_n-\frac{1}{2}\sum_{n=1}^{N}\sum_{m=1}^{N}\lambda_n\lambda_mt_nt_mx_n^Tx_m \\
Constraints
\lambda_n\geq 0 ,\sum_{i=1}^N\lambda_it_i=0
Thus, the problem of maximizing the constrained $ L (\ lambda) $ can be found using SMO for $ \ lambda $.
[Wikipedia](https://ja.wikipedia.org/wiki/%E9%80%90%E6%AC%A1%E6%9C%80%E5%B0%8F%E5%95%8F%E9%A1 According to% 8C% E6% 9C% 80% E9% 81% A9% E5% 8C% 96% E6% B3% 95), SMO is called ** sequential minimum problem optimization method ** in Japanese. Iteratively approaches the solution like the gradient method, selecting any two variables and repeating until they converge. These arbitrary 2 variables are called Working Set, and how to select these 2 variables is the key.
As a flow,
This process is repeated until there are no variables that violate the KKT conditions.
The Karush-Kuhn-Tucker condition, hereinafter the KKT condition, refers to the optimal condition that the first-order derivative must satisfy.
Actually, the KKT condition is also used in the above Lagrange undetermined multiplier method.
(1)
It is a condition. In particular, the fifth condition is called the ** complementarity condition **.
Complementarity condition
Will be derived, so you can choose $ \ lambda $ that does not meet this. On the contrary, suppose that the solution is found when $ \ lambda $ disappears. Let $ \ lambda $ selected here be $ \ lambda_1 $.
Select the other variable in the following order. First, you need to find $ \ boldsymbol {w} $ and $ w_0 $ in your current $ \ lambda
I decided to update $$ lambda_1 $ and $ \ lambda_2
\lambda_1^{new}t_1+\lambda_2^{new}t_2=\lambda_1t_1+\lambda_2t_2
It will be. This is divided into the case of $ t_1 = t_2 $ and the case of $ t_1 \ ne t_2 $.
\lambda_1^{new}=\lambda_1+\frac{t_1(E_2-E_1)}{x_1^2+x_1x_2+x_2^2} \\
\lambda_2^{new}=\lambda_2+t_1t_2(\lambda_1-\lambda_1^{new})
To get Actually, I needed a process called clipping to find the possible range of $ \ lambda_1 $ and $ \ lambda_2 $, but I was exhausted.
First of all, I've exhausted my theory, and it's been too long, so I'll just implement scikit-learn. sorry. I will definitely implement it by myself someday.
The implementation of python is pretty straightforward. Use sklearn.svm.LinearSVC to do classification using support vector machines in scikit-learn. .. This time, for the sake of clarity, we will use a sample in which 50 positive examples and 50 negative examples are properly separated.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import svm
fig, ax = plt.subplots()
x1_1=np.ones(50)+10*np.random.random(50)
x1_2=np.ones(50)+10*np.random.random(50)
x2_1=-np.ones(50)-10*np.random.random(50)
x2_2=-np.ones(50)-10*np.random.random(50)
x1 = np.c_[x1_1,x1_2]
x2 = np.c_[x2_1,x2_2]
y = np.array(np.r_[np.ones(50), -np.ones(50)])
model = svm.LinearSVC()
model.fit(np.array(np.r_[x1,x2]), y)
ax.scatter(x1[:,0],x1[:,1],marker='o',color='g',s=100)
ax.scatter(x2[:,0],x2[:,1],marker='s',color='b',s=100)
w = model.coef_[0]
x_fig = np.linspace(-12.,12.,100)
y_fig = [-w[0]/w[1]*xi-model.intercept_/w[1] for xi in x_fig]
ax.plot(x_fig, y_fig)
plt.show()
It was classified like this. It's natural.
I'm exhausted at the end, but I think I could grasp the atmosphere (I can't grasp it w). This time, we have derived the basics in the basics, linearly separable and examples (hard margin) that can properly classify positive and negative examples.
For more advanced or realistic support vector machines, it is necessary to consider cases where linear separation is not possible (kernel method) and cases where positive and negative examples cannot be properly classified (soft margin), but the basic theory Is almost the same as this time.
From the next time onward, I would like to deal with that area.
** Additional **: Written
Recommended Posts