Introduction

I implemented SVM (Support Vector Machine) in python. I used ["First pattern recognition"](http://www.amazon.co.jp/ First pattern recognition-Hirai-Yuzo / dp / 4627849710) as a textbook.

Structure of this article

Introduction
SVM
Maximize margin
Main problem
Lagrange undetermined multiplier method
KKT conditions
Dual problem
How to find a solution
Implementation in python
Result
in conclusion
Additional information

SVM

SVM is a two-class linear discrimination learning method that realizes the maximum margin. According to textbooks, identification of other classes is often realized by $ K-1 $ one-to-many SVMs.

Margin maximization

Consider the following data set. If the total number of data is $ N $, then $ i = 1, ..., N $.

\begin{align}
& D_L = \bigl\{(t_i, {\boldsymbol{x}}_i)\bigr\} \\
& t_i = \{-1, +1\}
\end{align}

Assuming that the margin of the linear discriminant function is $ \ kappa $, the following equation holds for all the training data.

|\boldsymbol{w}^T{\boldsymbol{x}}_i + b| \geq \kappa

The value normalized by $ \ kappa $ is replaced with $ \ boldsymbol {w} $ and $ b $, and the absolute value is removed by multiplying with the teacher.

t_i(\boldsymbol{w}^T{\boldsymbol{x}}_i + b) \geq 1 \tag{1}

As shown in the figure, the interclass margin can be calculated as the difference in length projected in the $ \ boldsymbol {w} $ direction for the data closest to the identification hyperplane.

In the class of $ t_ {i} = + 1 $, the data closest to the identification hyperplane has the smallest projection length in the $ \ boldsymbol {w} $ direction in that class. In the class of $ t_ {i} = -1 $, the data closest to the identification hyperplane has the longest projection length in the $ \ boldsymbol {w} $ direction in that class. The interclass margin $ \ rho (\ boldsymbol {w}, b) $ is expressed by the following equation.

\begin{align}
\rho(\boldsymbol{w}, b) &= \min_{\boldsymbol{x} \in C_{y=+1}}\frac{{\boldsymbol{w}}^T\boldsymbol{x}}{||\boldsymbol{w}||} - \max_{\boldsymbol{x} \in C_{y=-1}}\frac{{\boldsymbol{w}}^T\boldsymbol{x}}{||\boldsymbol{w}||} \\
\\
&= \frac{1-b}{||\boldsymbol{w}||} - \frac{-1-b}{||\boldsymbol{w}||} \\
\\
&= \frac{2}{||\boldsymbol{w}||} \tag{2}
\end{align}

formula(2)Than,||\boldsymbol{w}||We can see that the interclass margin is maximized by minimizing. Also, the interclass margin is determined only by the data closest to the identification hyperplane. The data closest to this identification hyperplane is called a support vector.

Main problem

Consider maximizing the interclass margin represented by equation (2) under the constraints of equation (1). This is the main problem.

Evaluation function (minimization)

L_p(\boldsymbol{w}) = \frac{1}{2}{\boldsymbol{w}}^{T}\boldsymbol{w}

Inequality constraints

t_i(\boldsymbol{w}^T{\boldsymbol{x}}_i + b) - 1 \geq 0

Lagrange undetermined multiplier method

We use the Lagrange undetermined multiplier method to solve the main problem.

The Lagrange's method of Lagrange multiplier (Lagrange's method of Lagrange multiplier) is a mathematical (analytical) method for optimizing under binding conditions. Consider the problem of finding the extremum of another function under the constraint that the values of some functions are fixed for some variables. By preparing constants (undetermined multiplier, Lagrange multiplier) for each binding condition and considering a linear combination with these as coefficients as a new function (the undetermined multiplier is also a new variable), the binding problem can be regarded as an ordinary extremum. It is a method that can be solved as a problem. Quoted from [wikipedia](https://ja.wikipedia.org/wiki/Lagrange's undetermined multiplier method)

Concrete example

I have no idea what I'm talking about, so I'll explain it with a concrete example. Consider maximizing $ f (x, y) = y --x $ under the condition of $ g (x, y) = x ^ 2 + y ^ 2 --2 = 0 $. The problem is that we use $ (x, y) $ on the circumference of the radius $ \ sqrt {2} $ to maximize the intercept of the straight line.

From the figure, we can see that $ f (x, y) $ takes the maximum value of $ 2 $ under the constraint when the circle and the straight line touch. Let's solve this using the Lagrange undetermined multiplier method.

\begin{align}
L(x, y, \lambda) &= f(x, y) + \lambda g(x, y) \\
\\
&= y - x + \lambda(x^2 + y^2 - 2) \tag{3}
\end{align}

Partially differentiate the Lagrange function in Eq. (3) with $ x, y, \ lambda $, and make them equal to $ 0 $, respectively.

\begin{align}
& \frac{\partial L}{\partial x} = -1 + 2\lambda x = 0 \\
& \frac{\partial L}{\partial y} = 1 + 2\lambda y = 0 \\
& \frac{\partial L}{\partial \lambda} = x^2 + y^2 - 2 = 0
\end{align}

Substitute the 1st and 2nd expressions into the 3rd expression.

\begin{align}
& \Bigl(\frac{1}{2\lambda}\Bigr)^2 + \Bigl(-\frac{1}{2\lambda}\Bigr)^2 - 2 = 0 \\
\\
& \lambda = \pm \frac{1}{2}, x = \pm 1, y = \mp 1
\end{align}

Therefore, under the constraint of $ g (x, y) = 0 $, $ f (x, y) = y-x $ takes the maximum value of $ 2 $. Let's think about this maximum value in three dimensions. In 3D space, $ z = f (x, y) = y --x $ is a slope and $ g (x, y) = x ^ 2 + y ^ 2 --1 = 0 $ is a cylinder. Since we want to find the maximum value of $ z = y --x $ under the constraint of $ g (x, y) = 0 $, we only need to find the maximum value of the $ z $ coordinate for the surface where the slope and the cylinder intersect. It becomes. In other words, the maximum value in the $ z $ axial direction when a circle of $ x ^ 2 + y ^ 2 --2 = 0 $ is projected onto the $ z = y --x $ plane is the value you want to find. This is shown in the figure. When viewed from the side, it is found that the maximum value in the $ z $ axial direction is $ 2 $ in the intersection of the slope and the cylinder, which is consistent with the calculation result.

Commentary

I will explain why the minimum value can be obtained by the Lagrange undetermined multiplier method. (May contain mistakes) Under the constraint condition $ g (\ boldsymbol {x}) = 0 $ (hereinafter referred to as the constraint surface), the evaluation function $ f (\ boldsymbol {x}) $ is in $ \ boldsymbol {x} ^ \ star $. When taking the maximum value, $ \ boldsymbol {\ nabla} f (\ boldsymbol {x}) $ must be perpendicular to the constraint plane. This is because $ f (\ boldsymbol {x}) $ can be made even larger along the direction of the constraint plane, assuming it is not vertical. Also, $ \ boldsymbol {\ nabla} g (\ boldsymbol {x}) $ is perpendicular to the constraint plane. Therefore, in $ \ boldsymbol {x} ^ \ star $, $ \ boldsymbol {\ nabla} f and \ boldsymbol {\ nabla} g $ are parallel vectors, and the following equation holds.

\boldsymbol{\nabla}f + \lambda\boldsymbol{\nabla}g = 0 \tag{4}

Here, the Lagrange function is defined as follows.

L(\boldsymbol{x}, \lambda) \equiv f(\boldsymbol{x}) + \lambda g(\boldsymbol{x})

Eq. (4) and constraints can be derived by partially differentiating the Lagrange function $ L (\ boldsymbol {x}, \ lambda) $ with $ \ boldsymbol {x}, \ lambda $. By solving this, the stop point $ \ boldsymbol {x} ^ \ star $ is obtained. In this $ \ boldsymbol {x} ^ \ star $, there is $ \ boldsymbol {x} $ that maximizes $ f (\ boldsymbol {x}) $.

KKT conditions

Now, let's apply this Lagrange's undetermined multiplier method to the main problem. The Lagrange function is defined as follows.

\tilde{L}_{p}(\boldsymbol{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}{\boldsymbol{w}}^T\boldsymbol{w} - \sum_{i=1}^{N}\alpha_i(t_i(\boldsymbol{w}^T{\boldsymbol{x}}_i + b) - 1) \tag{5}

Here, $ \ boldsymbol {\ alpha} = {(\ alpha_1, ..., \ alpha_N)} ^ T (\ alpha_i \ geq 0) $, and $ \ alpha_i $ is the Lagrange undetermined multiplier. The solution $ \ boldsymbol {w} _0, b_0 $ of this optimization problem is obtained as a solution that satisfies the following KKT conditions.

** **

\begin{align}
& \left.\frac{\partial \tilde{L}_p(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial \boldsymbol{w}}\right|_{\boldsymbol{w} = \boldsymbol{w}_0} = \boldsymbol{w}_0 - \sum_{i=1}^{N}\alpha_{i}t_i{\boldsymbol{x}}_i = 0 \tag{6} \\
& \frac{\partial \tilde{L}_p(\boldsymbol{w}, b, \boldsymbol{\alpha})}{\partial b} = \sum_{i=1}^{N}\alpha_{i}t_i = 0 \tag{7} \\
& \\
& t_i(\boldsymbol{w}^T{\boldsymbol{x}}_i + b) - 1 \geq 0 \tag{8} \\
& \\
& \alpha_i \geq 0 \tag{9} \\
& \\
& \alpha_i(t_i(\boldsymbol{w}^T{\boldsymbol{x}}_i + b) - 1) = 0 \tag{10}
\end{align}

Equations (6) and (7) are the partial differential equals $ 0 $ of the Lagrange function, equation (8) is the constraint condition of the main problem, and equation (10) is the complementarity condition. From the complementarity condition of equation (10)

t_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) - 1 > 0 \Rightarrow \alpha_i = 0 \\
t_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) - 1 = 0 \Rightarrow \alpha_i \geq 0

The complementarity condition shows that the constraint condition is not affected by the data other than the support vector, and the identification hyperplane is determined only by the support vector. (When I look it up, the KKT conditions seem to be quite deep, so I will move on with such a soft understanding.)

Duality problem

Expand Eq. (5) and substitute the optimal solution and Eqs. (6) and (7).

\begin{align}
L_{d}(\boldsymbol{\alpha}) &= \frac{1}{2}{\boldsymbol{w}_{0}}^{T}\boldsymbol{w}_{0} - \sum_{i=1}^{N}\alpha_{i}t_{i}{\boldsymbol{w}_{0}}^{T}{\boldsymbol{x}}_{i} - b\sum_{i=1}^{N}\alpha_{i}t_{i} + \sum_{i=1}^{N}\alpha_{i} \\
&= \frac{1}{2}{\boldsymbol{w}_{0}}^{T}\boldsymbol{w}_{0} - {\boldsymbol{w}_{0}}^{T}\boldsymbol{w}_{0} - 0 + \sum_{i=1}^{N}\alpha_{i} \\
&= \sum_{i=1}^{N}\alpha_{i} - \frac{1}{2}{\boldsymbol{w}_{0}}^{T}\boldsymbol{w}_{0} \\
&= \sum_{i=1}^{N}\alpha_{i} - \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\alpha_{i}\alpha_{j}t_{i}t_{j}{\boldsymbol{x}_i}^T{\boldsymbol{x}}_j
\end{align}

From Eq. (6), the optimal solution is represented by the linear sum of the Lagrange undetermined multiplier and the training data, so it can be replaced with the problem of finding $ \ alpha_ {i} $. The optimal Lagrange undetermined multiplier is obtained by $ \ boldsymbol {\ alpha} $ which maximizes $ L_ {d} (\ boldsymbol {\ alpha}) $. This is called the dual problem for the main problem. Let $ \ boldsymbol {1} $ be the vector of $ N $ $ 1 $, $ \ boldsymbol {H} $ be the matrix made up of the training data, and $ \ boldsymbol {t} $ be the teacher data vector.

\begin{align}
& \boldsymbol{1} = (1,...,1)^T \\
\\
& \boldsymbol{H} = (H_{ij} = {t_it_j\boldsymbol{x}_i}^T\boldsymbol{x}_j) \\
\\
& \boldsymbol{t} = (t_1,...,t_N)^T
\end{align}

Evaluation function (maximization)

L_{d}(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^T\boldsymbol{1} - \frac{1}{2}\boldsymbol{\alpha}^T\boldsymbol{H}\boldsymbol{\alpha}

Equality constraint

\boldsymbol{\alpha}^T\boldsymbol{t} = 0

The optimal solution of the Lagrange undetermined multiplier obtained by solving the dual problem is $ \ tilde {\ boldsymbol {\ alpha}} = (\ tilde {\ alpha} _1, ..., \ tilde {\ alpha} _N) ^ Given T $, the optimal solution can be found as follows.

\begin{align}
& \boldsymbol{w}_0 = \sum_{i=1}^{N}\tilde{\alpha}_{i}t_i{\boldsymbol{x}}_i \\
& b_0 = \frac{1}{t_s} - {\boldsymbol{w}_0}^T\boldsymbol{x}_s
\end{align}

$ \ boldsymbol {x} _s, t_s $ represent support vector data and teachers.

How to find a solution

We have derived that margin maximization can be achieved by replacing the main problem with a dual problem and finding the optimal $ \ tilde {\ boldsymbol {\ alpha}} $. Use the steepest descent method to programmatically find $ \ boldsymbol {\ alpha} $ that maximizes $ L_ {d} (\ boldsymbol {\ alpha}) $. We added a fine term to satisfy the constraint of the dual problem and redefined $ L_ {d} (\ boldsymbol {\ alpha}) $ as follows.

L_{d}(\boldsymbol{\alpha}) = \boldsymbol{\alpha}^T\boldsymbol{1} - \frac{1}{2}\boldsymbol{\alpha}^T\boldsymbol{H}\boldsymbol{\alpha} - \frac{1}{2}\beta\boldsymbol{\alpha}^T\boldsymbol{t}\boldsymbol{t}^T\boldsymbol{\alpha}

Now, let's actually update $ \ alpha_i $ with the steepest descent method.

\begin{align}
{\alpha_i}' &= {\alpha_i} + \eta\frac{L_{d}(\boldsymbol{\alpha})}{\alpha_i}　\\
&= {\alpha_i} + \eta(1 - \sum_{j=1}^{N}\alpha_{j}t_{i}t_{j}{\boldsymbol{x}_i}^T{\boldsymbol{x}}_j - \beta\sum_{j=1}^{N}\alpha_{j}t_{i}t_j)
\end{align}

I've done difficult calculations so far, but the final formula is beautiful. This will give you $ \ boldsymbol {\ alpha} $ that maximizes $ L_ {d} (\ boldsymbol {\ alpha}) $.

Implementation in python

I implemented it as follows. You can easily update $ \ alpha_i $. Also, due to the effect of the fine, the update will be made so that the condition of $ \ boldsymbol {\ alpha} ^ T \ boldsymbol {t} = 0 $ is satisfied.

`svm.py`


import numpy
from matplotlib import pyplot
import sys

def f(x, y):
	return x - y

if __name__ == '__main__':

	param = sys.argv
	
	numpy.random.seed()
	N = 30
	d = 2
	X = numpy.random.randn(N, d)
	T = numpy.array([1 if f(x, y) > 0 else - 1 for x, y in X])
	alpha = numpy.zeros(N)
	beta = 1.0
	eta_al = 0.0001 # update ratio of alpha
	eta_be = 0.1 # update ratio of beta
	itr = 1000
	
	for _itr in range(itr):
		for i in range(N):
			delta = 1 - (T[i] * X[i]).dot(alpha * T * X.T).sum() - beta * T[i] * alpha.dot(T)
			alpha[i] += eta_al * delta
		for i in range(N):
			beta += eta_be * alpha.dot(T) ** 2 / 2

	index = alpha > 0
	w = (alpha * T).T.dot(X)
	b = (T[index] - X[index].dot(w)).mean()
	
	if '-d' in param or '-s' in param:
		seq = numpy.arange(-3, 3, 0.02)
		pyplot.figure(figsize = (6, 6))
		pyplot.xlim(-3, 3)
		pyplot.ylim(-3, 3)
		pyplot.plot(seq, -(w[0] * seq + b) / w[1], 'k-')
		pyplot.plot(X[T ==  1,0], X[T ==  1,1], 'ro')
		pyplot.plot(X[T == -1,0], X[T == -1,1], 'bo')

		if '-s' in param:
			pyplot.savefig('graph.png')

		if '-d' in param:
			pyplot.show()

result

The results shown in the figure below were obtained. The color of the data points represents the class. When I drew the straight line obtained by calculation together, I got a result like that.

in conclusion

It turns out that the main problem for the purpose of maximizing the margin is set under the inequality condition, and this is replaced with the dual problem by using the Lagrange undetermined multiplier method, and the optimum $ \ boldsymbol {w} _0, b_0 $ can be obtained. I did.

Additional information

2016/05/01 I received a comment from @skyshk and corrected the code.

SVM implementation in python