Introduction

Previously, I summarized the hard margin SVM in "SVM implementation with python". Last time, we optimized by adding a constraint condition to the fine term, but this time we optimized it using a method called the active set method. I used ["Support Vector Machine"](https://www.amazon.co.jp/Support Vector Machine-Machine Learning Professional Series-Ichiro Takeuchi / dp / 4061529064) as a textbook.

Structure of this article

Introduction
Soft margin SVM
Main problem
Dual problem
Active set method
Implementation in python
Result
in conclusion

Soft margin SVM

Please refer to "Implementing SVM with python" for the introduction of SVM. Explains margin maximization, Lagrange undetermined multiplier method, and KKT conditions.

Main problem

In the soft margin SVM, the non-negative variables $ \ xi_i \ geq 0, i \ in [n] $ are introduced, and the constraints are defined as follows.

y_i(\boldsymbol w^T \boldsymbol x_i + b) \geq 1 - \xi_i, \ \ i \in [n]

$ \ Xi_ {i} $ makes the margin smaller than $ 1 $, allowing $ \ boldsymbol x_i $ to cross the margin and enter different classes. If a misclassification occurs, $ \ xi_ {i}> 1 $, so if $ \ sum_ {i \ in [n]} \ xi_ {i} $ is less than or equal to an integer $ K $, the misclassification will occur. The number is also less than $ K $. From this, it can be seen that misclassification can be suppressed by making $ \ sum_ {i \ in [n]} \ xi_ {i} $ as small as possible. Based on the above, the main problem of soft margin SVM is defined as follows.

\begin{align}
& \min_{\boldsymbol w, b, \boldsymbol \xi} \frac{1}{2}\|\boldsymbol w\|^2 + C\sum_{i \in [n]} \xi_{i} \\
& \\
& s.t. \ \ y_i(\boldsymbol w^T \boldsymbol x_i + b) \geq 1 - \xi_i, \ \ i \in [n] \\
& \\
& \ \ \ \ \ \ \ \ \xi_{i} \geq 0, \ \ i \in [n]
\end{align}

The coefficient $ C $ is a positive constant called the regularization coefficient. The first term of the objective function has the meaning of maximizing the margin. The second term suppresses the degree of violation of the original constraint $ \ boldsymbol w ^ T \ boldsymbol x_i + b \ geq 1 $ so that $ \ xi_ {i} $ is as small as possible. The regularization coefficient $ C $ is a parameter for adjusting the degree of this suppression. Increasing $ C $ approaches the hard margin, and at $ C = \ infty $, if the element $ \ boldsymbol \ xi $ has a value, an infinite value is added to the objective function, so $ \ boldsymbol \ xi $ Is always $ \ boldsymbol 0 $. On the contrary, if $ C $ is made smaller, misclassification will be tolerated.

Duality problem

Now, let us consider the dual representation of the main problem. Lagrange undetermined multiplier $ \ alpha_i \ geq 0, \ \ i \ in [n] $ and $ \ mu_i \ geq 0, \ \ i \ in [n] $ are introduced. The Lagrange function is defined as follows.

L(\boldsymbol w, b, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \mu)
 = \frac{1}{2}\|\boldsymbol w\|^2 + C\sum_{i \in [n]} \xi_{i} - \sum_{i \in [n]} \alpha_i(y_i(\boldsymbol w^T \boldsymbol x_i + b) - 1 + \xi_i) - \sum_{i \in [n]} \mu_i \xi_i \tag{1}

The partial differential of $ L $ at $ \ boldsymbol w, b, \ boldsymbol \ xi $ is $ 0 $.

\begin{align}
& \frac{\partial L}{\partial \boldsymbol w} = \boldsymbol w - \sum_{i \in [n]}\alpha_{i}y_i{\boldsymbol{x}}_i = \boldsymbol 0 \tag{2} \\
& \\
& \frac{\partial L}{\partial b} = \sum_{i \in [n]}\alpha_{i}y_i = 0 \tag{3} \\
& \\
& \frac{\partial L}{\partial \xi_i} = C - \alpha_i - \mu_i = 0 \tag{4} \\
\end{align}

Substituting Eqs. (2), (3), and (4) into Eq. (1), the result is as follows.

L = -\frac{1}{2} \sum_{i, \ j \in [n]}\alpha_{i}\alpha_{j}y_{i}y_{j}{\boldsymbol{x}_i}^T\boldsymbol x_j + \sum_{i \in [n]}\alpha_i \tag{5}

The dual problem can be expressed as follows by summarizing the constraints equations (3) and (4) and $ \ xi_i \ geq 0 $ and equation (5).

\begin{align}
& \max_{\boldsymbol \alpha} -\frac{1}{2} \sum_{i, \ j \in [n]}\alpha_{i}\alpha_{j}y_{i}y_{j}{\boldsymbol{x}_i}^T\boldsymbol x_j + \sum_{i \in [n]}\alpha_i \\
& \\
& s.t. \ \ \sum_{i \in [n]}\alpha_{i}y_i = 0 \\
& \\
& \ \ \ \ \ \ \ \ \ 0 \leq \alpha_i \leq C, \ \ i \in [n]
\end{align}

Finally, $ y_ {i} y_ {j} {\ boldsymbol x_i} ^ T \ boldsymbol x_j $ as a $ i, j $ element $ \ boldsymbol Q \ in \ mathbb {R} ^ {n \ times n If defined as} $, the dual problem can be expressed as a vector.

\begin{align}
& \max_{\boldsymbol \alpha} -\frac{1}{2} \boldsymbol \alpha^T\boldsymbol{Q\alpha} + \boldsymbol 1^T \boldsymbol \alpha \tag{6} \\
& \\
& s.t. \ \ \boldsymbol y^T \boldsymbol \alpha = 0 \tag{7} \\
& \\
& \ \ \ \ \ \ \ \ \boldsymbol 0 \leq \boldsymbol \alpha \leq C \boldsymbol 1 \tag{8}
\end{align}

Active set method

The objective function is maximized under the constraint equations (7) and (8). This time, we use the active set method as a standard solution for constrained optimization problems. The set of $ i $ such that $ \ alpha_i = 0 $ or $ \ alpha_i = C $ is called the active set for the constraint condition of the soft margin SVM. We define three sets as follows according to the state of the inequality constraint of the dual variable.

\begin{align}
& O = \bigl\{ \ i \in [n] \ \ | \ \ \alpha_i = 0 \ \bigr\} \\
& M = \bigl\{ \ i \in [n] \ \ | \ \ 0 < \alpha_i < C \ \bigr\} \\
& I = \bigl\{ \ i \in [n] \ \ | \ \ \alpha_i = C \ \bigr\}
\end{align}

If you know which set each case belongs to, then $ \ alpha_i = 0, \ i \ in O $ and $ \ alpha_i = C, \ i \ in I $. You only have to solve the dual problem for $ i \ in M $. The dual problem for the set $ M $ is as follows (the derivation process is omitted).

\begin{align}
& \max_{\boldsymbol \alpha_M} -\frac{1}{2} {\boldsymbol \alpha}_M^T{\boldsymbol Q}_M \boldsymbol \alpha_M^{ \ } - C \boldsymbol 1^T {\boldsymbol Q}_{I, M} \boldsymbol \alpha_M^{ \ } + \boldsymbol 1^T \boldsymbol \alpha_M^{ \ }
& \\
& s.t. \boldsymbol y_M^T \boldsymbol \alpha_M^{ \ } = -C \boldsymbol 1^T \boldsymbol y_I^{ \ }
\end{align}

For this problem, we introduce a new dual variable, $ \ nu $, to obtain the following Lagrange function.

L = -\frac{1}{2} {\boldsymbol \alpha}_M^T{\boldsymbol Q}_M \boldsymbol \alpha_M^{ \ } - C \boldsymbol 1^T {\boldsymbol Q}_{I, M} \boldsymbol \alpha_M^{ \ } + \boldsymbol 1^T \boldsymbol \alpha_M^{ \ } - \nu \left(\boldsymbol y_M^T \boldsymbol \alpha_M^{ \ } + C \boldsymbol 1^T \boldsymbol y_I^{ \ }\right)

The derivative of $ \ boldsymbol \ alpha_M ^ {} $ is set to $ \ boldsymbol 0 $, and it can be reduced to the following linear equations by arranging it together with the equation constraints.

\begin{bmatrix}
\boldsymbol Q_M & \boldsymbol y_M^{ \ } \\
\boldsymbol y_M^T & 0
\end{bmatrix}
\begin{bmatrix}
\boldsymbol \alpha_M^{ \ } \\
\nu
\end{bmatrix} = -C
\begin{bmatrix}
\boldsymbol Q_{M, I} \boldsymbol 1 \\
\boldsymbol 1^T \boldsymbol y_I^{ \ }
\end{bmatrix} +
\begin{bmatrix}
\boldsymbol 1 \\
0
\end{bmatrix} \tag{9}

$ \ nu $ represents the bias $ b $ when the point on the margin is $ M $. In other words, $ \ nu $ when the optimal $ \ boldsymbol \ alpha_M ^ {} $ is obtained is the optimal bias $ b $. Since the set $ \ {O, M, I } $ cannot be known in advance, the active set method solves equation (9) by giving an appropriate initial value and updates the active set. Finally, the optimization algorithm by the active set method is shown.

Input: Training data $ (\ boldsymbol X, \ boldsymbol Y) $
Output: Optimal solution for duality problem $ \ boldsymbol \ alpha, $ Bias $ b $
Initialization: $ \ boldsymbol \ alpha \ leftarrow \ boldsymbol 0, \ \ I \ leftarrow \ phi, \ \ M \ leftarrow \ phi, \ \ O \ leftarrow [n] $
** while ** $ \ y_ {i} \ f (\ boldsymbol x_i) <1, \ \ i \ in O $ or $ y_ {i} \ f (\ boldsymbol x_i)> 1, \ \ i \ $ i $ exists in I $ ** do **
Move from $ I $ or $ O $ to $ M $ if \left|1 - y_{i_O} \ f(\boldsymbol x_{i_O}) \right| \geq \left|1 - y_{i_I} \ f(\boldsymbol x_{i_I}) \right|OrI = \phi
\ \ \ M \leftarrow M \cup i_O, O \leftarrow O \backslash i_O
if \left|1 - y_{i_O} \ f(\boldsymbol x_{i_O}) \right| < \left|1 - y_{i_I} \ f(\boldsymbol x_{i_I}) \right|OrO = \phi
\ \ \ M \leftarrow M \cup i_I, I \leftarrow I \backslash i_I
However, $ i_O = argmax_ {i \ in O} -y_i \ f (\ boldsymbol x_i), \ \ i_I = argmax_ {i \ in I} \ y_i \ f (\ boldsymbol x_i) $
repeat
Substitute the solution of linear equation (9) into $ \ boldsymbol \ alpha_M ^ {new} $
\boldsymbol d \leftarrow \boldsymbol \alpha_M^{new} - \boldsymbol \alpha_M^{ \ }
\boldsymbol \alpha_M^{ \ } \in [0, C]^{|M|}Maximum step width in the area of\eta \geq 0By\boldsymbol \alpha_M^{ \ } \leftarrow \boldsymbol \alpha_M^{ \ } + \eta \boldsymbol dage, Move from $ M $ to $ I $ or $ O $ if there is a $ i $ that hits the constraint boundary
until ( \ \boldsymbol \alpha_M^{ \ } = \boldsymbol \alpha_M^{new} \ )
end while

In the optimization of soft margin SVM by the active set method, Add the most violating data of $ I $ or $ O $ to $ M $, and find $ \ boldsymbol \ alpha_M ^ {} $ in that state. Data is moved between $ I $ or $ O $ and $ M $, and learning ends when there are no more violating data.

Implementation in python

I implemented it as follows (sorry for the dirty code). The learning may not converge, so you may have misunderstood the algorithm or made a mistake in the implementation. I can't fix it myself, so please debug if you are kind (Dogeza).

`activeset.py`


import numpy
from matplotlib import pyplot
import sys

def f(x, y):
	return y - x

def g(T, X, w, b):
	return T * (X.dot(w) + b)

def argmax(T, X, w, b, li):
	return numpy.argmax(g(T[li], X[li], w, b))

def solver(T, X, C, li, lm):
	Ti, Tm = T[li], T[lm]
	Xi, Xm = X[li], X[lm]
	Qm = (Tm.reshape(-1, 1) * Xm).dot(Xm.T * Tm)
	Qmi = (Tm.reshape(-1, 1) * Xm).dot(Xi.T * Ti)
	A = numpy.vstack((numpy.hstack((Qm, Tm.reshape(-1, 1))), numpy.hstack((Tm, 0))))
	B = -C * numpy.vstack((Qmi.sum(axis = 1).reshape(-1, 1), Ti.sum()))
	B[:-1] += 1
	return numpy.linalg.inv(A).dot(B)

def calc_eta(d, alpha):
	index = (d != 0) * (alpha[lm] >= 0) * (alpha[lm] <= C)
	eta_z = (0 - alpha[lm][index]) / d[index]
	eta_c = (C - alpha[lm][index]) / d[index]
	return numpy.hstack((eta_z[eta_z > 0], eta_c[eta_c > 0]))

def set_to_list(_set):
	return list(_set)

def move(_from, _to, value):
	_from.remove(value)
	_to.add(value)

def graph(T, X, w, param):
    seq = numpy.arange(-5, 5, 0.02)
    pyplot.figure(figsize = (6, 6))
    pyplot.xlim(-3, 3)
    pyplot.ylim(-3, 3)
    pyplot.plot(seq, -(w[0] * seq + b) / w[1], 'k-')
    pyplot.plot(X[T == 1, 0], X[T == 1, 1], 'ro')
    pyplot.plot(X[T == -1, 0], X[T == -1, 1], 'b^')

    if '-s' in param:
    	pyplot.savefig('graph.png')
    if '-d' in param:
    	pyplot.show()


if __name__ == '__main__':

	param = sys.argv

	numpy.random.seed()
	N = 20
	dim = 2
	X = numpy.random.randn(N, dim)
	T = numpy.array([1 if f(x, y) > 0 else - 1 for x, y in X])
	X[T == 1, 1] += 0.5
	X[T == -1, 1] -= 0.5

    # initialize
	alpha = numpy.zeros(N)
	w = numpy.zeros(dim)
	b = 0
	C = 0.1
	I, M, O = set(), set(), set(range(N))
	li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)

	while numpy.argwhere(g(T[lo], X[lo], w, b) < 1).size > 0 or numpy.argwhere(g(T[li], X[li], w, b) > 1).size > 0:
		flag = 0
		if I == set():
			io = argmax(-1 * T, X, w, b, lo)
			move(O, M, lo[io])
			flag = 1
		if O == set():
			ii = argmax(T, X, w, b, li)
			move(I, M, li[ii])
			flag = 1
		if flag == 0:
			io = argmax(-1 * T, X, w, b, lo)
			ii = argmax(T, X, w, b, li)
			if abs(1 - g(T[lo[io]], X[lo[io]], w, b)) >= abs(1 - g(T[li[ii]], X[li[ii]], w, b)):
				move(O, M, lo[io])
			else:
				move(I, M, li[ii])
		while M != set():
			li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)
			solve = solver(T, X, C, li, lm)
			alpha_new = solve[:-1, 0]
			if numpy.array_equal(alpha[lm], alpha_new):
				break
			if alpha_new[(alpha_new < 0) + (alpha_new > C)].size == 0:
				alpha[lm] = alpha_new
			else:
				d = alpha_new - alpha[lm]
				eta = calc_eta(d, alpha)
				if eta.size == 0 or d[d != 0].sum() == 0:
					break
				alpha[lm] += eta.min() * d
				for i in numpy.argwhere(alpha[lm] <= 0)[:, 0]:
					move(M, O, lm[i])
				for j in numpy.argwhere(alpha[lm] >= C)[:, 0]:
					move(M, I, lm[j])

		w = ((alpha * T).reshape(-1, 1) * X).sum(axis = 0)
		b = solve[-1, 0]
		li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)

	if '-d' in param or '-s' in param:
		graph(T, X, w, param)

result

I drew a straight line using the learned $ \ boldsymbol w $ and saved the result as an image. The results are shown below.

in conclusion

We have optimized the dual problem using the active set method. Please note that there is a high possibility that the implementation code is incorrect.

SVM optimization by active set method