Understand support vector machines with mathematical formulas and try with scikit-learn.
It is assumed that you have already learned calculus and linear algebra.
The support vector machine trains to maximize the margin between the support vector and the identification boundary based on a small number of data called the support vector that determines the identification boundary.
Support vector machines are linear discriminative models, but they are widely used as excellent classification models because they are highly extensible to non-linear discriminative problems due to kernel tricks.
Originally a binary classification model, it is also used for multiclass classification, regression, and anomaly detection.
First, consider the binary linear classification problem. Here, the training data is $ x = (x_1, ..., x_n) $, the target value is $ t = (t_1, ..., t_n) $, but $ t \ in \ {-1, 1 \ If } $ and the parameter is $ w $, the linear model of the discriminative model output $ y $ is
y(x) = w^T x + b
It can be expressed as. We will explicitly write the bias $ b $ here. Assuming that it is linearly separable, $ y (x_i) = 1 $ when $ t_i = 1 $ and $ y (x_i) = -1 $ when $ t_i = -1 $, so all training data About $ t_n y (x_n)> 0 $ holds.
The support vector machine aims to maximize the margin, which is the distance from the identification boundary $ y = w ^ Tx + b $ to the nearest training data, as shown in the figure below.
From the formula of the distance between the point and the plane, the distance between the point of the training data and the identification boundary surface can be expressed by the following formula.
\frac{|w^Tx + b|}{||w||}
Furthermore, considering only the training data $ t_i y (x_i)> 0 $ that is correctly classified,
\frac{t_i ( w^Tx_i + b )}{||w||}
It will be. Since the margin is the distance from the training data closest to the discrimination boundary and the parameters $ w and b $ that maximize the margin are to be obtained, the following formula can be defined.
\newcommand{\argmax}{\mathop{\rm argmax}\limits}
\argmax_{w, b} \frac{1}{||w||} \min_i \{ t_i (w^Tx_i + b) \}
Now, if you scale $ w, b $ so that $ t_i (w ^ Tx + b) = 1 $, all the training data will meet the following constraints.
t(w^Tx + b) \geq 1
further,
\newcommand{\argmin}{\mathop{\rm argmin}\limits}
\argmin_{w, b} = \frac{1}{2}||w^2|| \\
t_i(w^Tx_i + b) \geq 1
To find the parameters of a support vector machine, you need to solve a constrained optimization problem. So, if we introduce the Lagrange multiplier $ a = (a_1, ..., a_n) $, which is $ a \ geq 0 $,
L(w, b, a) = \frac{1}{2} ||w^2|| - \sum^n_{i=1} a_i \{ t_i(w^Tx_i + b) - 1 \}
It will be. If this $ L $ is differentiated with respect to the parameters $ w and b $ and solved as 0,
\begin{align}
\frac{\partial L}{\partial w} &= w - \sum^n_{i=1}a_i t_i x_i = 0\\
w &= \sum^n_{i=1}a_i t_i x_i \\
\frac{\partial L}{\partial b} &= \sum^n_{i=1}a_i t_i = 0 \\
0 &= \sum^n_{i=1}a_i t_i
\end{align}
Substituting these into $ L (w, b, a) $ and deleting $ w, b $
\begin{align}
L(a) &= \frac{1}{2} w^T w - \sum^n_{i=1} a_i t_i w^T x_i - b \sum^n_{i=1} a_i t_i + \sum^n_{i=1} a_i \\
&= \frac{1}{2} w^T w - w^T w + \sum^n_{i=1} a_i \\
&= \sum^n_{i=1} a_i - \frac{1}{2} w^T w \\
&= \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j=1} a_i a_j t_i t_j x^T_i x_j
\end{align}
Therefore, we have to solve the following constrained quadratic programming problem, which is called the dual problem with respect to the main problem. Since the main problem and the dual problem have a one-to-one correspondence, it is sufficient to solve the dual problem instead of the main problem.
L(a) = \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j = 1} a_i a_j t_i t_j k(x_i, x_j) \\
a_i \geq 0 \\
\sum^n_{i=1} a_n t_n = 0
In the above equation, $ k (x_i, x_j) = x ^ T_i x_j $ represents the kernel function, which is a linear kernel here. Solving this gives a sparse solution such that only a small number of data corresponding to the support vector is $ a_i \ neq 0 $, and the others are $ a_i = 0 $.
If you add the penalty term $ \ frac {1} {2} \ lambda \ sum ^ n_ {i, j = 1} a_i a_j t_i t_j $ with the hyperparameters as $ \ lambda $ to satisfy the constraints of the dual problem,
L(a) = \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j = 1} a_i a_j t_i t_j k(x_i, x_j) - \frac{1}{2} \lambda \sum^n_{i,j = 1} a_i a_j t_i t_j
It will be. Differentiating $ L (a) $ with respect to $ a_i $
\frac{\partial L}{\partial a_i} = 1 - \sum^n_{j=1} a_j t_i t_j k(x_i, x_j) - \lambda \sum^n_{j=1} a_j t_i t_j
It will be. If you solve the maximization problem by the gradient method with a learning rate of $ \ eta $,
\begin{align}
\hat{a_i} &= a_i + \eta \frac{\partial L}{\partial \alpha_i} \\
&= a_i + \eta \left( 1 - \sum^n_{j=1} a_j t_i t_j k(x_i, x_j) - \lambda \sum^n_{j=1} a_j t_i t_j \right)
\end{align}
It will be. Convergence is guaranteed because this problem is a convex quadratic optimization problem. In addition, the support vector machine stores $ n $ of support vector data, and when predicting, classifies based on that support vector.
Since the parameter $ w $ is derived from $ a $, we finally derive the bias $ b $. Given the new data $ x_j $, classify using the following formula:
y(x) = \sum^m_{j=1} a_j t_j k(x_i, x_j) + b
If it can be classified correctly, $ t_i y (x_i) = 1 $ will be satisfied, so
t_i \left( \sum^m_{j=1} a_it_ik(x_i, x_j) + b \right) = 1
It will be. Here, from $ t ^ 2_i = 1 $,
\begin{align}
t_i \left( \sum^m_{j=1} a_j t_j k(x_i, x_j) + b \right) &= t^2_i \\
\sum^m_{j=1} a_j t_j k(x_i, x_j) + b &= t_i \\
\end{align}
Therefore, the bias $ b $ is
b = \frac{1}{n} \sum^n_{i=1} \left( t_i - \sum^m_{j=1} a_j t_j k(x_i, x_j) \right)
It will be. If the number of data is 100,000 or more, the argument loss of SGDClassifier is set to'hinge', and the support vector machine by the stochastic gradient descent method is recommended.
The support vector machine in the previous section is called a hard margin, and a soft margin that allows a small number of misclassifications is proposed. In the soft margin, we introduce the slack variable $ \ xi $ and the hyperparameter $ C $ to solve the following equation.
\argmin_{w, b, \xi} \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i \\
\xi_i \geq 0 \\
t_i (w^T x_i + b) \geq 1 - \xi_i \\
Up to the previous section, we considered the case where linear identification is possible. However, in real-life problems, there are few linearly identifiable problems. If you want to perform nonlinear discrimination on a support vector machine, you can solve the exact same optimization problem as the linear discrimination problem from the kernel function $ K (x_i, x_j) $. Avoiding such non-linear transformations is called a kernel trick.
The kernel functions available on support vector machines must meet the following conditions:
\sum_{i,j} a_i a_j K(x_i, x_j) > 0
The typical kernel functions are listed below.
The linear kernel is the linear support vector machine up to the previous section.
k(x_i, x_j) = x_i \cdot x_j
The polynomial kernel transforms to higher dimensions of degree $ p $. $ P, \ gamma, c $ are incremented as hyperparameters.
k(x_i, x_j) = (\gamma x_i \cdot x_j + c)^p
The radial basis function kernel theoretically transforms to infinite dimensions. Most often used as a non-linear kernel. You need to adjust the hyperparameters $ \ gamma $.
k(x_i, x_j) = \exp \left( -\gamma || x_i - x_j ||^2 \right)
The sigmoid kernel is a semi-definite function rather than a definite function, but it has similarities to neural networks.
k(x_i, x_j) = \tanh (\gamma x_i \cdot x_j + c)
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2
The implemented program is published on GitHub.
svm_classification.py
svm_regression.py
svm_anomaly.py
I used the breast cancer dataset that was also used in Logistic Regression.
Accuracy 94.74%
Precision, Positive predictive value(PPV) 96.92%
Recall, Sensitivity, True positive rate(TPR) 94.03%
Specificity, True negative rate(TNR) 95.74%
Negative predictive value(NPV) 91.84%
F-Score 95.45%
The smaller the hyperparameter $ C $, the wider the margin from the identification boundary, and the larger the hyperparameter, the narrower the margin.
The discriminant boundaries are straight in the linear kernel, but circular in the polynomial kernel, curved in the sigmoid kernel, and more complex in the RBF kernel.
Smaller RBF kernel hyperparameters $ \ gamma $, which are often used to perform nonlinear discrimination, loosen the discriminant boundaries, and large them tighten. The hyperparameter $ \ gamma $ can be regarded as the reciprocal of the variance, which is a convincing result.
I used the iris dataset for the multiclass classification data.
The data of the regression problem was executed as $ k = 5 $ by adding a random number to the sine wave.
Anomaly detection uses a 1-class support vector machine. In the figure below, areas other than the blue area are treated as abnormal values.
Christpher M. Bishop, "Pattern Recognition and Machine Learning", Springer, 2006.
Recommended Posts