Previously, I summarized the hard margin SVM in "SVM implementation with python". Last time, we optimized by adding a constraint condition to the fine term, but this time we optimized it using a method called the active set method. I used ["Support Vector Machine"](https://www.amazon.co.jp/Support Vector Machine-Machine Learning Professional Series-Ichiro Takeuchi / dp / 4061529064) as a textbook.
Structure of this article
Please refer to "Implementing SVM with python" for the introduction of SVM. Explains margin maximization, Lagrange undetermined multiplier method, and KKT conditions.
In the soft margin SVM, the non-negative variables $ \ xi_i \ geq 0, i \ in [n] $ are introduced, and the constraints are defined as follows.
y_i(\boldsymbol w^T \boldsymbol x_i + b) \geq 1 - \xi_i, \ \ i \in [n]
$ \ Xi_ {i} $ makes the margin smaller than $ 1 $, allowing $ \ boldsymbol x_i $ to cross the margin and enter different classes. If a misclassification occurs, $ \ xi_ {i}> 1 $, so if $ \ sum_ {i \ in [n]} \ xi_ {i} $ is less than or equal to an integer $ K $, the misclassification will occur. The number is also less than $ K $. From this, it can be seen that misclassification can be suppressed by making $ \ sum_ {i \ in [n]} \ xi_ {i} $ as small as possible. Based on the above, the main problem of soft margin SVM is defined as follows.
\begin{align}
& \min_{\boldsymbol w, b, \boldsymbol \xi} \frac{1}{2}\|\boldsymbol w\|^2 + C\sum_{i \in [n]} \xi_{i} \\
& \\
& s.t. \ \ y_i(\boldsymbol w^T \boldsymbol x_i + b) \geq 1 - \xi_i, \ \ i \in [n] \\
& \\
& \ \ \ \ \ \ \ \ \xi_{i} \geq 0, \ \ i \in [n]
\end{align}
The coefficient $ C $ is a positive constant called the regularization coefficient. The first term of the objective function has the meaning of maximizing the margin. The second term suppresses the degree of violation of the original constraint $ \ boldsymbol w ^ T \ boldsymbol x_i + b \ geq 1 $ so that $ \ xi_ {i} $ is as small as possible. The regularization coefficient $ C $ is a parameter for adjusting the degree of this suppression. Increasing $ C $ approaches the hard margin, and at $ C = \ infty $, if the element $ \ boldsymbol \ xi $ has a value, an infinite value is added to the objective function, so $ \ boldsymbol \ xi $ Is always $ \ boldsymbol 0 $. On the contrary, if $ C $ is made smaller, misclassification will be tolerated.
Now, let us consider the dual representation of the main problem. Lagrange undetermined multiplier $ \ alpha_i \ geq 0, \ \ i \ in [n] $ and $ \ mu_i \ geq 0, \ \ i \ in [n] $ are introduced. The Lagrange function is defined as follows.
L(\boldsymbol w, b, \boldsymbol \xi, \boldsymbol \alpha, \boldsymbol \mu)
= \frac{1}{2}\|\boldsymbol w\|^2 + C\sum_{i \in [n]} \xi_{i} - \sum_{i \in [n]} \alpha_i(y_i(\boldsymbol w^T \boldsymbol x_i + b) - 1 + \xi_i) - \sum_{i \in [n]} \mu_i \xi_i \tag{1}
The partial differential of $ L $ at $ \ boldsymbol w, b, \ boldsymbol \ xi $ is $ 0 $.
\begin{align}
& \frac{\partial L}{\partial \boldsymbol w} = \boldsymbol w - \sum_{i \in [n]}\alpha_{i}y_i{\boldsymbol{x}}_i = \boldsymbol 0 \tag{2} \\
& \\
& \frac{\partial L}{\partial b} = \sum_{i \in [n]}\alpha_{i}y_i = 0 \tag{3} \\
& \\
& \frac{\partial L}{\partial \xi_i} = C - \alpha_i - \mu_i = 0 \tag{4} \\
\end{align}
Substituting Eqs. (2), (3), and (4) into Eq. (1), the result is as follows.
L = -\frac{1}{2} \sum_{i, \ j \in [n]}\alpha_{i}\alpha_{j}y_{i}y_{j}{\boldsymbol{x}_i}^T\boldsymbol x_j + \sum_{i \in [n]}\alpha_i \tag{5}
The dual problem can be expressed as follows by summarizing the constraints equations (3) and (4) and $ \ xi_i \ geq 0 $ and equation (5).
\begin{align}
& \max_{\boldsymbol \alpha} -\frac{1}{2} \sum_{i, \ j \in [n]}\alpha_{i}\alpha_{j}y_{i}y_{j}{\boldsymbol{x}_i}^T\boldsymbol x_j + \sum_{i \in [n]}\alpha_i \\
& \\
& s.t. \ \ \sum_{i \in [n]}\alpha_{i}y_i = 0 \\
& \\
& \ \ \ \ \ \ \ \ \ 0 \leq \alpha_i \leq C, \ \ i \in [n]
\end{align}
Finally, $ y_ {i} y_ {j} {\ boldsymbol x_i} ^ T \ boldsymbol x_j $ as a $ i, j $ element $ \ boldsymbol Q \ in \ mathbb {R} ^ {n \ times n If defined as} $, the dual problem can be expressed as a vector.
\begin{align}
& \max_{\boldsymbol \alpha} -\frac{1}{2} \boldsymbol \alpha^T\boldsymbol{Q\alpha} + \boldsymbol 1^T \boldsymbol \alpha \tag{6} \\
& \\
& s.t. \ \ \boldsymbol y^T \boldsymbol \alpha = 0 \tag{7} \\
& \\
& \ \ \ \ \ \ \ \ \boldsymbol 0 \leq \boldsymbol \alpha \leq C \boldsymbol 1 \tag{8}
\end{align}
The objective function is maximized under the constraint equations (7) and (8). This time, we use the active set method as a standard solution for constrained optimization problems. The set of $ i $ such that $ \ alpha_i = 0 $ or $ \ alpha_i = C $ is called the active set for the constraint condition of the soft margin SVM. We define three sets as follows according to the state of the inequality constraint of the dual variable.
\begin{align}
& O = \bigl\{ \ i \in [n] \ \ | \ \ \alpha_i = 0 \ \bigr\} \\
& M = \bigl\{ \ i \in [n] \ \ | \ \ 0 < \alpha_i < C \ \bigr\} \\
& I = \bigl\{ \ i \in [n] \ \ | \ \ \alpha_i = C \ \bigr\}
\end{align}
If you know which set each case belongs to, then $ \ alpha_i = 0, \ i \ in O $ and $ \ alpha_i = C, \ i \ in I $. You only have to solve the dual problem for $ i \ in M $. The dual problem for the set $ M $ is as follows (the derivation process is omitted).
\begin{align}
& \max_{\boldsymbol \alpha_M} -\frac{1}{2} {\boldsymbol \alpha}_M^T{\boldsymbol Q}_M \boldsymbol \alpha_M^{ \ } - C \boldsymbol 1^T {\boldsymbol Q}_{I, M} \boldsymbol \alpha_M^{ \ } + \boldsymbol 1^T \boldsymbol \alpha_M^{ \ }
& \\
& s.t. \boldsymbol y_M^T \boldsymbol \alpha_M^{ \ } = -C \boldsymbol 1^T \boldsymbol y_I^{ \ }
\end{align}
For this problem, we introduce a new dual variable, $ \ nu $, to obtain the following Lagrange function.
L = -\frac{1}{2} {\boldsymbol \alpha}_M^T{\boldsymbol Q}_M \boldsymbol \alpha_M^{ \ } - C \boldsymbol 1^T {\boldsymbol Q}_{I, M} \boldsymbol \alpha_M^{ \ } + \boldsymbol 1^T \boldsymbol \alpha_M^{ \ } - \nu \left(\boldsymbol y_M^T \boldsymbol \alpha_M^{ \ } + C \boldsymbol 1^T \boldsymbol y_I^{ \ }\right)
The derivative of $ \ boldsymbol \ alpha_M ^ {} $ is set to $ \ boldsymbol 0 $, and it can be reduced to the following linear equations by arranging it together with the equation constraints.
\begin{bmatrix}
\boldsymbol Q_M & \boldsymbol y_M^{ \ } \\
\boldsymbol y_M^T & 0
\end{bmatrix}
\begin{bmatrix}
\boldsymbol \alpha_M^{ \ } \\
\nu
\end{bmatrix} = -C
\begin{bmatrix}
\boldsymbol Q_{M, I} \boldsymbol 1 \\
\boldsymbol 1^T \boldsymbol y_I^{ \ }
\end{bmatrix} +
\begin{bmatrix}
\boldsymbol 1 \\
0
\end{bmatrix} \tag{9}
$ \ nu $ represents the bias $ b $ when the point on the margin is $ M $. In other words, $ \ nu $ when the optimal $ \ boldsymbol \ alpha_M ^ {} $ is obtained is the optimal bias $ b $. Since the set $ \ {O, M, I } $ cannot be known in advance, the active set method solves equation (9) by giving an appropriate initial value and updates the active set. Finally, the optimization algorithm by the active set method is shown.
In the optimization of soft margin SVM by the active set method, Add the most violating data of $ I $ or $ O $ to $ M $, and find $ \ boldsymbol \ alpha_M ^ {} $ in that state. Data is moved between $ I $ or $ O $ and $ M $, and learning ends when there are no more violating data.
I implemented it as follows (sorry for the dirty code). The learning may not converge, so you may have misunderstood the algorithm or made a mistake in the implementation. I can't fix it myself, so please debug if you are kind (Dogeza).
activeset.py
import numpy
from matplotlib import pyplot
import sys
def f(x, y):
return y - x
def g(T, X, w, b):
return T * (X.dot(w) + b)
def argmax(T, X, w, b, li):
return numpy.argmax(g(T[li], X[li], w, b))
def solver(T, X, C, li, lm):
Ti, Tm = T[li], T[lm]
Xi, Xm = X[li], X[lm]
Qm = (Tm.reshape(-1, 1) * Xm).dot(Xm.T * Tm)
Qmi = (Tm.reshape(-1, 1) * Xm).dot(Xi.T * Ti)
A = numpy.vstack((numpy.hstack((Qm, Tm.reshape(-1, 1))), numpy.hstack((Tm, 0))))
B = -C * numpy.vstack((Qmi.sum(axis = 1).reshape(-1, 1), Ti.sum()))
B[:-1] += 1
return numpy.linalg.inv(A).dot(B)
def calc_eta(d, alpha):
index = (d != 0) * (alpha[lm] >= 0) * (alpha[lm] <= C)
eta_z = (0 - alpha[lm][index]) / d[index]
eta_c = (C - alpha[lm][index]) / d[index]
return numpy.hstack((eta_z[eta_z > 0], eta_c[eta_c > 0]))
def set_to_list(_set):
return list(_set)
def move(_from, _to, value):
_from.remove(value)
_to.add(value)
def graph(T, X, w, param):
seq = numpy.arange(-5, 5, 0.02)
pyplot.figure(figsize = (6, 6))
pyplot.xlim(-3, 3)
pyplot.ylim(-3, 3)
pyplot.plot(seq, -(w[0] * seq + b) / w[1], 'k-')
pyplot.plot(X[T == 1, 0], X[T == 1, 1], 'ro')
pyplot.plot(X[T == -1, 0], X[T == -1, 1], 'b^')
if '-s' in param:
pyplot.savefig('graph.png')
if '-d' in param:
pyplot.show()
if __name__ == '__main__':
param = sys.argv
numpy.random.seed()
N = 20
dim = 2
X = numpy.random.randn(N, dim)
T = numpy.array([1 if f(x, y) > 0 else - 1 for x, y in X])
X[T == 1, 1] += 0.5
X[T == -1, 1] -= 0.5
# initialize
alpha = numpy.zeros(N)
w = numpy.zeros(dim)
b = 0
C = 0.1
I, M, O = set(), set(), set(range(N))
li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)
while numpy.argwhere(g(T[lo], X[lo], w, b) < 1).size > 0 or numpy.argwhere(g(T[li], X[li], w, b) > 1).size > 0:
flag = 0
if I == set():
io = argmax(-1 * T, X, w, b, lo)
move(O, M, lo[io])
flag = 1
if O == set():
ii = argmax(T, X, w, b, li)
move(I, M, li[ii])
flag = 1
if flag == 0:
io = argmax(-1 * T, X, w, b, lo)
ii = argmax(T, X, w, b, li)
if abs(1 - g(T[lo[io]], X[lo[io]], w, b)) >= abs(1 - g(T[li[ii]], X[li[ii]], w, b)):
move(O, M, lo[io])
else:
move(I, M, li[ii])
while M != set():
li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)
solve = solver(T, X, C, li, lm)
alpha_new = solve[:-1, 0]
if numpy.array_equal(alpha[lm], alpha_new):
break
if alpha_new[(alpha_new < 0) + (alpha_new > C)].size == 0:
alpha[lm] = alpha_new
else:
d = alpha_new - alpha[lm]
eta = calc_eta(d, alpha)
if eta.size == 0 or d[d != 0].sum() == 0:
break
alpha[lm] += eta.min() * d
for i in numpy.argwhere(alpha[lm] <= 0)[:, 0]:
move(M, O, lm[i])
for j in numpy.argwhere(alpha[lm] >= C)[:, 0]:
move(M, I, lm[j])
w = ((alpha * T).reshape(-1, 1) * X).sum(axis = 0)
b = solve[-1, 0]
li, lm, lo = set_to_list(I), set_to_list(M), set_to_list(O)
if '-d' in param or '-s' in param:
graph(T, X, w, param)
I drew a straight line using the learned $ \ boldsymbol w $ and saved the result as an image. The results are shown below.
We have optimized the dual problem using the active set method. Please note that there is a high possibility that the implementation code is incorrect.
Recommended Posts