Target

Understand support vector machines with mathematical formulas and try with scikit-learn.

It is assumed that you have already learned calculus and linear algebra.

theory

The support vector machine trains to maximize the margin between the support vector and the identification boundary based on a small number of data called the support vector that determines the identification boundary.

Support vector machines are linear discriminative models, but they are widely used as excellent classification models because they are highly extensible to non-linear discriminative problems due to kernel tricks.

Originally a binary classification model, it is also used for multiclass classification, regression, and anomaly detection.

Classification by support vector machine

Derivation of support vector machine

First, consider the binary linear classification problem. Here, the training data is $ x = (x_1, ..., x_n) $, the target value is $ t = (t_1, ..., t_n) $, but $ t \ in \ {-1, 1 \ If } $ and the parameter is $ w $, the linear model of the discriminative model output $ y $ is

y(x) = w^T x + b

It can be expressed as. We will explicitly write the bias $ b $ here. Assuming that it is linearly separable, $ y (x_i) = 1 $ when $ t_i = 1 $ and $ y (x_i) = -1 $ when $ t_i = -1 $, so all training data About $ t_n y (x_n)> 0 $ holds.

The support vector machine aims to maximize the margin, which is the distance from the identification boundary $ y = w ^ Tx + b $ to the nearest training data, as shown in the figure below.

From the formula of the distance between the point and the plane, the distance between the point of the training data and the identification boundary surface can be expressed by the following formula.

\frac{|w^Tx + b|}{||w||}

Furthermore, considering only the training data $ t_i y (x_i)> 0 $ that is correctly classified,

\frac{t_i ( w^Tx_i + b )}{||w||}

It will be. Since the margin is the distance from the training data closest to the discrimination boundary and the parameters $ w and b $ that maximize the margin are to be obtained, the following formula can be defined.

\newcommand{\argmax}{\mathop{\rm argmax}\limits}

\argmax_{w, b} \frac{1}{||w||} \min_i \{ t_i (w^Tx_i + b) \}

Now, if you scale $ w, b $ so that $ t_i (w ^ Tx + b) = 1 $, all the training data will meet the following constraints.

t(w^Tx + b) \geq 1

further,||w||^{-1}Maximization problem||w||^2If you replace it with the minimization problem of, the formula to be solved by the support vector machine becomes the following constrained minimization problem, which is called the main problem.

\newcommand{\argmin}{\mathop{\rm argmin}\limits}

\argmin_{w, b} = \frac{1}{2}||w^2|| \\

t_i(w^Tx_i + b) \geq 1

Support vector machine learning

To find the parameters of a support vector machine, you need to solve a constrained optimization problem. So, if we introduce the Lagrange multiplier $ a = (a_1, ..., a_n) $, which is $ a \ geq 0 $,

L(w, b, a) = \frac{1}{2} ||w^2|| - \sum^n_{i=1} a_i \{ t_i(w^Tx_i + b) - 1 \}

It will be. If this $ L $ is differentiated with respect to the parameters $ w and b $ and solved as 0,

\begin{align}
\frac{\partial L}{\partial w} &= w - \sum^n_{i=1}a_i t_i x_i = 0\\
w &= \sum^n_{i=1}a_i t_i x_i \\
\frac{\partial L}{\partial b} &= \sum^n_{i=1}a_i t_i = 0 \\
0 &= \sum^n_{i=1}a_i t_i
\end{align}

Substituting these into $ L (w, b, a) $ and deleting $ w, b $

\begin{align}
L(a) &= \frac{1}{2} w^T w - \sum^n_{i=1} a_i t_i w^T x_i - b \sum^n_{i=1} a_i t_i + \sum^n_{i=1} a_i \\
&= \frac{1}{2} w^T w - w^T w + \sum^n_{i=1} a_i \\
&= \sum^n_{i=1} a_i - \frac{1}{2} w^T w \\
&= \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j=1} a_i a_j t_i t_j x^T_i x_j
\end{align}

Therefore, we have to solve the following constrained quadratic programming problem, which is called the dual problem with respect to the main problem. Since the main problem and the dual problem have a one-to-one correspondence, it is sufficient to solve the dual problem instead of the main problem.

L(a) = \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j = 1} a_i a_j t_i t_j k(x_i, x_j) \\
a_i \geq 0 \\
\sum^n_{i=1} a_n t_n = 0

In the above equation, $ k (x_i, x_j) = x ^ T_i x_j $ represents the kernel function, which is a linear kernel here. Solving this gives a sparse solution such that only a small number of data corresponding to the support vector is $ a_i \ neq 0 $, and the others are $ a_i = 0 $.

If you add the penalty term $ \ frac {1} {2} \ lambda \ sum ^ n_ {i, j = 1} a_i a_j t_i t_j $ with the hyperparameters as $ \ lambda $ to satisfy the constraints of the dual problem,

L(a) = \sum^n_{i=1} a_i - \frac{1}{2} \sum^n_{i,j = 1} a_i a_j t_i t_j k(x_i, x_j) - \frac{1}{2} \lambda \sum^n_{i,j = 1} a_i a_j t_i t_j

It will be. Differentiating $ L (a) $ with respect to $ a_i $

\frac{\partial L}{\partial a_i} = 1 - \sum^n_{j=1} a_j t_i t_j k(x_i, x_j) - \lambda \sum^n_{j=1} a_j t_i t_j

It will be. If you solve the maximization problem by the gradient method with a learning rate of $ \ eta $,

\begin{align}
\hat{a_i} &= a_i + \eta \frac{\partial L}{\partial \alpha_i} \\
&= a_i + \eta \left( 1 - \sum^n_{j=1} a_j t_i t_j k(x_i, x_j) - \lambda \sum^n_{j=1} a_j t_i t_j \right)
\end{align}

It will be. Convergence is guaranteed because this problem is a convex quadratic optimization problem. In addition, the support vector machine stores $ n $ of support vector data, and when predicting, classifies based on that support vector.

Since the parameter $ w $ is derived from $ a $, we finally derive the bias $ b $. Given the new data $ x_j $, classify using the following formula:

y(x) = \sum^m_{j=1} a_j t_j k(x_i, x_j) + b

If it can be classified correctly, $ t_i y (x_i) = 1 $ will be satisfied, so

t_i \left( \sum^m_{j=1} a_it_ik(x_i, x_j) + b \right) = 1

It will be. Here, from $ t ^ 2_i = 1 $,

\begin{align}
t_i \left( \sum^m_{j=1} a_j t_j k(x_i, x_j) + b \right) &= t^2_i \\
\sum^m_{j=1} a_j t_j k(x_i, x_j) + b &= t_i \\
\end{align}

Therefore, the bias $ b $ is

b = \frac{1}{n} \sum^n_{i=1} \left( t_i - \sum^m_{j=1} a_j t_j k(x_i, x_j) \right)

It will be. If the number of data is 100,000 or more, the argument loss of SGDClassifier is set to'hinge', and the support vector machine by the stochastic gradient descent method is recommended.

Soft margin

The support vector machine in the previous section is called a hard margin, and a soft margin that allows a small number of misclassifications is proposed. In the soft margin, we introduce the slack variable $ \ xi $ and the hyperparameter $ C $ to solve the following equation.

\argmin_{w, b, \xi} \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i \\
\xi_i \geq 0 \\
t_i (w^T x_i + b) \geq 1 - \xi_i \\

Kernel trick

Up to the previous section, we considered the case where linear identification is possible. However, in real-life problems, there are few linearly identifiable problems. If you want to perform nonlinear discrimination on a support vector machine, you can solve the exact same optimization problem as the linear discrimination problem from the kernel function $ K (x_i, x_j) $. Avoiding such non-linear transformations is called a kernel trick.

The kernel functions available on support vector machines must meet the following conditions:

\sum_{i,j} a_i a_j K(x_i, x_j) > 0

The typical kernel functions are listed below.

Linear kernel

The linear kernel is the linear support vector machine up to the previous section.

k(x_i, x_j) = x_i \cdot x_j

Polynomial kernel

The polynomial kernel transforms to higher dimensions of degree $ p $. $ P, \ gamma, c $ are incremented as hyperparameters.

k(x_i, x_j) = (\gamma x_i \cdot x_j + c)^p

Radial Basis Function (RBF) Kernel

The radial basis function kernel theoretically transforms to infinite dimensions. Most often used as a non-linear kernel. You need to adjust the hyperparameters $ \ gamma $.

k(x_i, x_j) = \exp \left( -\gamma || x_i - x_j ||^2 \right)

Sigmoid kernel

The sigmoid kernel is a semi-definite function rather than a definite function, but it has similarities to neural networks.

k(x_i, x_j) = \tanh (\gamma x_i \cdot x_j + c)

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2

Program to run

The implemented program is published on GitHub.

`svm_classification.py`

`svm_regression.py`

`svm_anomaly.py`

result

Classification by support vector machine

Classification by LinearSVC

I used the breast cancer dataset that was also used in Logistic Regression.

Accuracy 94.74%
Precision, Positive predictive value(PPV) 96.92%
Recall, Sensitivity, True positive rate(TPR) 94.03%
Specificity, True negative rate(TNR) 95.74%
Negative predictive value(NPV) 91.84%
F-Score 95.45%

Comparison of hyperparameter C

The smaller the hyperparameter $ C $, the wider the margin from the identification boundary, and the larger the hyperparameter, the narrower the margin.

Kernel function comparison

The discriminant boundaries are straight in the linear kernel, but circular in the polynomial kernel, curved in the sigmoid kernel, and more complex in the RBF kernel.

Comparison of hyperparameters γ in RBF kernel

Smaller RBF kernel hyperparameters $ \ gamma $, which are often used to perform nonlinear discrimination, loosen the discriminant boundaries, and large them tighten. The hyperparameter $ \ gamma $ can be regarded as the reciprocal of the variance, which is a convincing result.

Multi-class classification

I used the iris dataset for the multiclass classification data.

Support vector machine regression

The data of the regression problem was executed as $ k = 5 $ by adding a random number to the sine wave.

Anomaly detection by support vector machine

Anomaly detection uses a 1-class support vector machine. In the figure below, areas other than the blue area are treated as abnormal values.

reference

1.4. Suppoer Vector Machines

Christpher M. Bishop, "Pattern Recognition and Machine Learning", Springer, 2006.

Machine Learning: Supervised --Support Vector Machine

Target

theory

Classification by support vector machine

Derivation of support vector machine

Support vector machine learning

Soft margin

Kernel trick

Linear kernel

Polynomial kernel

Radial Basis Function (RBF) Kernel

Sigmoid kernel

Implementation

Execution environment

hardware

software

Program to run

svm_classification.py

svm_regression.py

svm_anomaly.py

result

Classification by support vector machine

Classification by LinearSVC

Comparison of hyperparameter C

Kernel function comparison

Comparison of hyperparameters γ in RBF kernel

Multi-class classification

Support vector machine regression

Anomaly detection by support vector machine

reference

`svm_classification.py`

`svm_regression.py`

`svm_anomaly.py`