Introduction to Machine Learning-Hard Margin SVM Edition-

1. What is Support Vector Machine (SVM)?

In the previous Introduction to Machine Learning from Simple Perceptron, the learning method of simple perceptron uses the gradient of the error function to make the error function smaller and smaller. It was. However, this method has the following two problems.

-The generalization ability of the model is not guaranteed. -Not available due to linear separable problems.

Here, the model refers to the expression of $ f (x) $ after learning is completed. In addition, generalization ability refers to the ability to correctly predict class labels and function values not only for training data given at the time of learning but also for unknown new data. In the first place, the purpose of using machine learning such as simple perceptron is to classify whether emails that are not used in learning are spam well in the case of spam emails. By no means, it's not enough to classify only emails that have come before. For such things, signatures are sufficient, not the role of machine learning.

By the way, SVM is an algorithm that solves the above two problems.

The solution to these two points was the introduction of margin maximization and kernel tricks. Margin refers to the distance between the _f (x) _ line and the two classes. This margin maximization is an attempt to maximize the generalization ability by maximizing the distance corresponding to M in the figure below. margin.png

Kernel trick refers to mapping data from the original data space to a higher dimensional space and performing linear data analysis on that higher dimensional space. It's a little difficult story, but you can think of the kernel trick as doing something like the image below. kernel.png

"Why can two issues be solved by using margin maximization and kernel tricks?" Will be described later.

2. SVM formulation

In the previous Introduction to Machine Learning from Simple Perceptron and the contents described so far, when there is noise in the data when learning the data He didn't even mention what to do. In the implementation of the simple perceptron, learning was continued "until the error became zero". In this case, for example, if the data looks like the figure below, the previous implementation would continue to adjust the parameters forever. noise.png Of course, adjustments can be made to stop learning when it is determined that the error has become the minimum, but even in that case, the accuracy of classification will decrease due to the presence of noise. Even in SVM, the implementation called hard margin SVM has a weakness that it is weak when noise is mixed in the data. Therefore, the idea was a soft margin SVM that is strong to some extent even when the data is mixed with noise. A noise-resistant model such as this soft-margin SVM is called a robust model. In the real world, there is a lot of noisy data and soft margin SVMs are used for most tasks. Hard margin SVM is easier to stick to than soft margin SVM, so this time I will explain only hard margin SVM.

2.1. Hesse's formula

As you may be familiar with in high school number II textbooks, I will explain the Hesse formula, which is the formula used to derive margins. Here, we do not provide official proof, but only describe how it is used. Hesse's formula is a formula for calculating the distance between a point and a straight line, and is calculated by the following formula.

Point A(x_0,y_0)From, the length d of the perpendicular line drawn to the straight line ax + by + c = 0 is\\
d=\frac{|ax_0+by_0+c|}{\sqrt{a^2+b^2}}

hesse.png

In deriving the SVM, the following, which is a multidimensional extension of this, is used.

Data point A=(x_1,x_2,...,x_m)From hyperplane f(x)=w^Tx+\The distance to theta\\
d=\frac{|w^Tx+\theta|} {||w||}

2.2. Lagrange's undetermined multiplier method

Lagrange's undetermined multiplier method is used when maximizing the margin under the constraints described below. Lagrange's undetermined multiplier method is a method of optimizing a problem as an ordinary extremum problem under certain conditions. An example of the two-dimensional case is shown below. Problem of finding the point $ (a, b) $ where $ f (x, y) $ is the maximum value under the constraint $ g (x, y) = 0 $

\left\{
\begin{array}{ll}
Maximize& f(x,y) \\
Constraints& g(x,y)=0 
\end{array}
\right.
think of.\\

At this time, when considering the undetermined multiplier $ \ lambda $ called the Lagrange multiplier, the following function can be defined.

L(x,y,\lambda)=f(x,y)-\lambda g(x,y) \\
point(a,b)so\frac{\partial g}{\partial x}\neq0 and\frac{\partial g}{\partial y}\When neq 0\\
\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}=\frac{\partial L}{\partial \lambda}=0 \\ 

By finding the point $ (x, y) $ that satisfies the condition as a stop point, the maximum value $ (x, y) $ can be found.

Also, here, the stop point refers to the point where the derivative becomes 0. The SVM is generalized using the following, which is an extension of this to general multidimensional. Under $ M $ constraints $ g_k (x_1, ..., x_n) = 0, (k = 1, ..., M) $ Problem of finding n-dimensional point $ x = (x_1, ..., x_n) $ where $ f (x_1, ..., x_n) $ is the maximum value

\left\{
\begin{array}{ll}
Maximize& f(x_1,...,x_n) \\
Constraints& g_k(x_1,...,x_n)=0,(k=1,...,M)
\end{array}
\right.
think of.\\
At this time, an undetermined multiplier called the Lagrange multiplier\lambda=(\lambda_1,...,\lambda_M)When considering, the following function can be defined.\\
L(x_1,...,x_n,\lambda)=f(x_1,...,x_n)-\sum_{k=1}^M\lambda_k g_k(x_1,...,x_n) \\
point(x_1,...,x_M)so\nabla g_k=(\frac{\partial f}{\partial x_1},...,\frac{\partial f}{\partial x_n})\neq (0,...,0)When,\\
N+M conditions:\frac{\partial L}{\partial \alpha}=0(\alpha=x_1,...,x_n,\lambda_1,...,\lambda_M)\\
Points to meet(x_1,...,x_n)The point that becomes the maximum value by finding as the stop point(x_1,...,x_n)Is sought.

2.3. KKT (Karush-Kuhn-Tucker) conditions

Consider a mathematical programming problem with the following inequality constraints for the differentiable functions $ f and g_i $.

\left\{
\begin{array}{ll}
Minimize& f(x) \\
Constraints& g_i(x) \leq 0 (i=1,...,m)
\end{array}
\right.

The following formula is the KKT condition for the above.

\left\{
\begin{array}{l}
\nabla f(\hat{x})+\sum_{i=1}^m \hat{\lambda_i} \nabla g_i(\hat{x})=0  \\
\hat{\lambda_i} \geq 0,g_i(\hat{x})\leq 0,\hat{\lambda_i} g_i(\hat{x}) = 0 (i=1,...,m)
\end{array}
\right.

Here with the above formula

\hat{\lambda_i} \geq 0,g_i(\hat{x})\leq 0,\hat{\lambda_i} g_i(\hat{x}) = 0 (i=1,...,m)

Indicates that at least one of $ \ hat {\ lambda_i} and g_i (\ hat {x}) $ is 0 for each $ i $, which is generally called a complementarity condition.

2.4. Derivation of hard margin SVM

When the total number of training data is n, the minimum distance d between the training data and the hyperplane is given below by "2.1. Hesse's formula".

d=min_{i=1,...,n}\{\frac{|w^Tx_i+\theta|}{||w||}\} \\

Here, $ min_ {i = 1, ..., n} $ is a function that returns the value when $ i $ takes the smallest value among $ 1 to n $ in the calculation result.

I want to find $ w $ and $ \ theta $ that maximize this minimum distance $ d $. The formula for this is as follows.

d_{max}=max_{w,\theta}[min_{i=1,...,n}\{\frac{|w^Tx_i+\theta|}{||w||}\}] 

Here, $ max_ {w, \ theta} $ is a function that returns the value with the largest calculation result in the function among the possible $ w $ and $ \ theta . At this time,||w||IsiBecause the result does not depend onmin$Because it can be taken out of the function

d_{max}=max_{w,\theta}[\frac{1}{||w||}min_{i=1,...,n}\{|w^Tx_i+\theta|\}] \\

Can be expressed as. w^Tx+\theta=0When,\frac{1}{||w||}No matter what value it takes Since $ d_max $ becomes 0 and $ w and \ theta $ are not uniquely determined, the following restrictions are added.

min_{i=1,2,...,n}|w^Tx_i+\theta|=1
This will
max_{w,\theta}\frac{1}{||w||} \\
The discriminant function that is\\
From hyperplane to learning data on both sides M=\frac{1}{||w||}Away
\frac{2}{||w||}Width of(Margin: margin)have.\\
max_{w,\theta}\frac{1}{||w||}Is min from the computational side_{w,\theta}\frac{1}{2}||w||^Equivalent to 2.\\
Also teacher label t_By using i, the constraint min_{i=1,2,...,n}|w^Tx_i+\theta|=1 can be written as follows.\\
t_i(w^Tx_i+\theta)\geq 1,(i=1,2,...,n) \\

If the calculation results of the teacher data $ t_i $ and the discriminant function $ w ^ Tx_i + \ theta $ are inconsistent, $ t_i (w ^ Tx_i + \ theta) <1 . That is, satisfying the above constraint condition indicates that no misidentification has occurred. Under the condition that no misidentification has occurred\frac{1}{2}||w||^2$Can be reduced to the problem of minimizing. When this is formulated, it becomes as follows.

\left\{
\begin{array}{l}
min_{w,\theta}\frac{1}{2}||w||^2 \\
Constraints:t_i(w^Tx_i+\theta)\geq 1,(i=1,2,...,n)
\end{array}
\right. 

At this time, the following function can be defined when considering the undetermined multiplier $ \ lambda (\ lambda_i \ geq 0) $ called the Lagrange multiplier. It can be replaced with the problem of finding the minimum value of this function.

L(w,\theta,\lambda)=\frac{1}{2}||w||^2-\sum_{i=1}^{N}\lambda_i{t_i(w^Tx_i+\theta)-1} \\
At the minimum value, the slope of L(Slope)Should be 0\\
\frac{\partial L}{\partial b}=\sum_{i=1}^{N}\lambda_it_i=0 \\
\frac{\partial L}{\partial w}=w-\sum_{i=1}^{N}\lambda_i t_i x_i=0 \\
\frac{\partial L}{\partial w}When the right side of is transformed, it becomes as follows.\\
w=\sum_{i=1}^{N}\lambda_i t_i x_i \\
L(w,\theta,\lambda)Is expanded to make it easier to calculate.\\
L(w,\theta,\lambda)=\frac{1}{2}||w||^2-\sum_{i=1}^{N}\lambda_it_iw^Tx_i+\lambda_it_i\theta-\lambda_i \\
L(w,\theta,\lambda)=\frac{1}{2}||w||^2-\sum_{i=1}^{N}\lambda_it_iw^Tx_i-\theta\sum_{i=1}^{N}\lambda_it_i+\sum_{i=1}^{N}\lambda_i \\
here\theta\sum_{i=1}^{N}It is because it is a constant that is out of.\\
L(w,\theta,\lambda)To\sum_{i=1}^{N}\lambda_it_i=0,w=\sum_{i=1}^{N}\lambda_i t_i x_Substitute i.\\
L(w,\theta,\lambda)=\frac{1}{2}||w||^2-\sum_{i=1}^{N}\lambda_it_ix_i\sum_{j=1}^{N}\lambda_j t_j x_j-\theta×0+\sum_{i=1}^{N}\lambda_i \\
L(w,\theta,\lambda)=\frac{1}{2}w\cdot w-\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\lambda_i \\
L(w,\theta,\lambda)=\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\lambda_i \\
L(w,\theta,\lambda)=-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\lambda_i \\
This can be organized as a dual format as follows.\\
\left\{
\begin{array}{l}
max_{\lambda}-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\lambda_i \\
Constraints:\sum_{i=1}^{N}\lambda_it_i=0,\lambda_i\geq 0
\end{array}
\right. \\

Such a problem is called a convex quadratic programming problem for $ \ lambda_i $. When the solution $ \ hat {\ lambda_i} (> 0) $ of this equation is found, $ \ hat {w} $ is obtained from $ w = \ sum_ {i = 1} ^ {N} \ lambda_i t_i x_i $. .. By this convex quadratic programming, the local optimum solution is always the global optimum solution. Optimal solution $ \ hat {w}, \ hat {\ theta}, \ hat {\ lambda} $ must satisfy KKT complementary condition $ \ hat {\ lambda_i} t_i (w ^ Tx_i + \ theta)-1 = 0 $ Must be. When the inequality constraint $ t_i (w ^ Tx_i + \ theta)-1> 0 $ is satisfied, $ \ lambda_i = 0 $ must be satisfied from the complementary condition. Therefore, this constraint is no longer valid. Also, if $ t_i (w ^ Tx_i + \ theta) -1 = 0 $, then $ \ lambda_i> 0 $, so the constraint is valid. For this reason, only the constraint by the point (hereinafter, support vector) that determines the margin that satisfies $ t_i (w ^ Tx_i + \ theta) -1 = 0 $ is enabled. Since the number of support vectors is considerably smaller than the original number of samples, it saves a large amount of calculation when performing identification.

Finally, the SVM is derived as follows.

f(x)=w^Tx+\theta,w=\sum_{i=1}^{N}\lambda_i t_i x_From i\\
\hat{f}(x)=\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j +\theta

The value of $ \ theta $ is obtained from the complementarity condition $ t_i (w ^ Tx_i + \ theta)-1 = 0 $ as follows.

t_i(w^Tx_i+\theta)-1=0,w=\sum_{i=1}^{N}\lambda_i t_i x_From i\\
t_i(\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j+\theta)-1=0 \\
T on both sides_Divide by i\\
\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j+\theta - \frac{1}{t_i}=0\\
\Transfer terms other than theta\\
\theta=\frac{1}{t_i} -\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j\\
\frac{1}{t_i}Is t_i\in \{1,-1\}Because it is t_Because it can be said to be equivalent to i\\
\theta=t_i -\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j

When finding $ \ theta $ in an actual implementation, it is often the case that the average is taken using all the support vectors that satisfy the conditions in consideration of the stability in numerical calculation. When averaging, $ \ theta $ is calculated by the following formula.

\theta = \frac{1}{N_S} \sum_{n \in S}(t_n - \sum_{m \in S}\hat{\lambda}_mt_m x_n^T x_m) \\
S:Set of support vectors(=\{s_1,s_2,...,s_{N_S}\}) \\
N_S:Number of support vectors

Finally, the identification boundary is drawn by the following formula.

\hat{f}(x)=\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j +  \frac{1}{N_S} \sum_{n \in S}(t_n - \sum_{m \in S}\hat{\lambda}_mt_m x_n^T x_m)

3. Kernel trick

By optimizing the margin, it can be expected that the generalization ability will be improved compared to the simple perceptron. However, the implementation so far can only solve linear separable problems. A kernel trick was conceived to solve this problem. As I explained at the beginning of what the kernel trick is doing, it is "to map data from the original data space to a higher dimensional space and perform linear data analysis on that higher dimensional space". Only how to implement it is described here, and it is too advanced to explain why it does not affect the implementation itself, so I will omit it. The following formula finally obtained earlier is transformed.

\hat{f}(x)=\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i x_i^T x_j +  \frac{1}{N_S} \sum_{n \in S}(t_n - \sum_{m\in S}\hat{\lambda}_mt_m x_n^T x_m)

Do this.

\hat{f}(x)=\sum_{i=1}^{N}\sum_{j=1}^{N}\hat{\lambda_i}t_i K(x_i,x_j) +  \frac{1}{N_S} \sum_{n \in S}(t_n - \sum_{m \in S}\hat{\lambda}_mt_m K(x_n,x_m)

$ K (\ cdot, \ cdot) $ is a kernel function. If you want to know more about kernel tricks, learn the kernel method. Specifically, it is necessary to study the reproducing kernel Hilbert space, the Representer theorem, and Mercer's theorem.

4. SVM learning algorithm

There are three typical methods for optimizing $ \ lambda $.

4.1. Learning by the steepest descent method

The steepest descent method is an algorithm that searches for the minimum value of a function from the slope of the function. $ n $ Consider finding the minimum value of this function with $ f (x) $ as the function that takes the following vector $ x = (x1, x2, ..., xn) $ as arguments. The gradient method uses an iterative method to bring $ x $ closer to the solution. When the solution is at the position of $ x ^ (k) $ in the $ k $ iteration, the steepest descent method updates the value as follows. When minimizing a function

x^{(k+1)} = x^{(k)}-\alpha \nabla f(x^{(k)})

When maximizing a function

x^{(k+1)} = x^{(k)}+\alpha \nabla f(x^{(k)})

Learning behaves as shown in the figure below, which is updated so that it rolls down a slope. The figure is one-dimensional for visualization, but the same is true for multi-dimensional cases. steepest.png Use any number for the value of $ \ alpha $. (Generally a number between 0 and 1) If $ \ alpha $ is too large, the distance down the slope will be large with one update, and it will not be possible to make small turns, so the gap between the actual minimum value and the calculated approximate value will be large. If $ \ alpha $ is too small, the distance down the hill will be short with one update, and the amount of calculation will increase significantly. Therefore, setting the value of $ \ alpha $ appropriately is also a showcase for machine learning engineers. This time, we will implement the steepest descent method, which is the easiest to implement.

The formulation of the steepest descent method into SVM is shown below. Consider the variable $ \ lambda $ that maximizes $ L (w, \ theta, \ lambda) $.

\lambda_i^{(k+1)}=\lambda_i^{(k)}+\alpha \nabla L(w,\theta,\lambda) \\
\lambda_i^{(k+1)}=\lambda_i^{(k)}+\alpha \frac{\partial L(w,\theta,\lambda)}{\partial \lambda} \\
L(w,\theta,\lambda)=-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\lambda_i\lambda_j t_it_j x_ix_j+\sum_{i=1}^{N}\lambda_From i\\
\lambda_i^{(k+1)}=\lambda_i^{(k)}+\alpha (1-\sum_{j=1}^{N} \lambda_j t_i t_j x_i^T x_j) \\
\lambda_i^{(k+1)}=\lambda_i^{(k)}+\alpha (1-\sum_{j=1}^{N} \lambda_j t_i t_j K(x_i,x_j))

The variable $ \ lambda $ is updated by the above formula.

4.2. Learning by SMO algorithm

The SMO algorithm is an optimization method that selects two variables to be optimized and updates them iteratively. It is also used in the world-famous SVM library LIBSVM. The implementation itself is C ++ on the page below. http://web.cs.iastate.edu/~honavar/smo-svm.pdf http://www.cs.mcgill.ca/~hv/publications/99.04.McGill.thesis.gmak.pdf I won't go into it anymore here, but I'll publish it with a commentary.

4.3. Learning by stochastic gradient descent

Stochastic gradient descent (SGD) is, in layman's terms, an improvement over the steepest descent method, which is batch learning, to online learning. A rather recent paper proposes an implementation called Pegosas below. http://ttic.uchicago.edu/%7Enati/Publications/PegasosMPB.pdf By using the stochastic gradient descent method, it seems that online learning can be performed with SVM. Since we do not know the details of the algorithm here, we will omit it. I want to implement it soon.

5. Hard margin SVM implementation example by python

It seems that it was possible, but the support vector may not be calculated or a strange support vector may be calculated. Perhaps the cause is the implementation of the steepest descent method, which determines the number of updates to make a mistake. If the value of the Lagrange multiplier is converged, it seems that the support vector can be found properly. If you know the cause, please let me know. When I implemented SVM once when I was a college student, I didn't have such a strange stumbling block, so it's rather a mystery. (The source code at that time was lost) Since the support vector is not calculated when overflowing with the update formula of the steepest descent method, If I get motivated, I will set an upper limit on the value.

# -*- coding: utf-8 -*-
#File name is svm.py
import numpy as np
import random
from math import fabs
from scipy.linalg import norm
import matplotlib.pyplot as plt

#SVM class
class svm:
  #Initial value setting
  def __init__(self,kernel='linear',sigma=1.,degree=1.,threshold=1e-5,loop=500,LR=0.2):
    self.kernel = kernel
    #In the case of a linear kernel, it is equivalent to setting all the parameters of the polynomial kernel to 1.
    if self.kernel == 'linear':
      self.kernel = 'poly'
      self.degree = 1.
    #RBF and Poly(Polynomial)Parameters for kernel
    self.sigma = sigma
    self.degree = degree
    #Threshold for finding support vector
    self.threshold = threshold
    #Number of updates
    self.loop = loop
    #Learning rate
    self.LR = LR

  #Methods to calculate the kernel
  def build_kernel(self,X):
    self.K = np.dot(X,X.T)
    if self.kernel == 'poly':
      self.K = (1. + 1. / self.sigma * self.K)**self.degree

    if self.kernel == 'rbf':
      self.xsquared = (np.diag(self.K)*np.ones((1,self.N))).T
      b = np.ones((self.N,1))
      self.K -= 0.5*(np.dot(self.xsquared,b.T) + np.dot(b,self.xsquared.T))
      self.K = np.exp(self.K/(2.*self.sigma**2))

  #Methods of the learning part of SVM
  def train_svm(self,X,t):
  #Start of steepest descent
    # self.Substitute the number of rows of X for N
    self.N = np.shape(X)[0]
    #Calculate kernel
    self.build_kernel(X)
    count = 0
    #Lagrange multiplier initialization
    lam = np.ones((self.N,1))
    #Loop for the specified number of updates
    while(count < self.loop):
      for i in range(self.N):
        ans = 0
        for j in range(self.N):
          #Part of the renewal of the steepest descent method
          ans += lam[j]*t[i]*t[j]*self.K[i,j]
        #Renewal formula of the steepest descent method
        lam[i] += self.LR * (1-ans)
        #Since the Lagrange multiplier is 0 or more, if a negative value is taken, 0 is substituted.
        if lam[i] < 0:
          lam[i] = 0
      count += 1
  #The steepest descent method ends
    #Extract the Lagrange multiplier that exceeds the threshold value as a support vector
    self.sv = np.where(lam>self.threshold)[0]
    #Number of support vectors
    self.nsupport = len(self.sv)
    print self.nsupport,"support vector found"

    #Store only the data of each support vector
    self.X = X[self.sv,:]
    self.lam = lam[self.sv]
    self.t = t[self.sv]

    #formula for w
    self.w = np.array([0.,0.])
    for i in range(self.nsupport):
      self.w += self.lam[i]*self.t[i]*self.X[i]

    #Calculation formula for θ
    self.b = np.sum(self.t)
    for n in range(self.nsupport):
      self.b -= np.sum(self.lam*self.t*np.reshape(self.K[self.sv[n],self.sv],(self.nsupport,1)))
    self.b /= len(self.lam)

    #Classification part for linear or polynomial kernels
    if self.kernel == 'poly':
      def classifier(Y):
        K = (1. + 1./self.sigma*np.dot(Y,self.X.T))**self.degree
        self.y = np.zeros((np.shape(Y)[0],1))
        for j in range(np.shape(Y)[0]):
          for i in range(self.nsupport):
            self.y[j] += self.lam[i]*self.t[i]*K[j,i]
        return self.y

    #Classification part for RBF kernel
    elif self.kernel == 'rbf':
      def classifier(Y):
        K = np.dot(Y,self.X.T)
        c = (1./self.sigma * np.sum(Y**2,axis=1)*np.ones((1,np.shape(Y)[0]))).T
        c = np.dot(c,np.ones((1,np.shape(K)[1])))
        aa = np.dot(self.xsquared[self.sv],np.ones((1,np.shape(K)[0]))).T
        K = K - 0.5*c - 0.5*aa
        K = np.exp(K/(2.*self.sigma**2))
        self.y = np.zeros((np.shape(Y)[0],1))
        for j in range(np.shape(Y)[0]):
          for i in range(self.nsupport):
            self.y[j] += self.lam[i]*self.t[i]*K[j,i]
          self.y[j] += self.b
        return self.y
    else:
      print "[Error] There is not this Kernel --",self.kernel
      return

    self.classifier = classifier

if __name__ == "__main__":
  from svm import svm
 #Define training data
  cls1 = [] #Class 1
  cls2 = [] #Class 2

  mean1 = [-1, 2]
  mean2 = [1, -1]
  cov = [[1.0,0.8], [0.8, 1.0]]
  N = 100 
  #Create class 1 training data with random numbers
  cls1.extend(np.random.multivariate_normal(mean1, cov, N/2))
  #Create class 2 training data with random numbers
  cls2.extend(np.random.multivariate_normal(mean2, cov, N/2))

  #Create data matrix X
  X = np.vstack((cls1, cls2))

  #Create label t
  t = []
  for i in range(N/2):
    t.append(1.0)   #Class 1
  for i in range(N/2):
    t.append(-1.0)  #Class 2
  t = np.array(t)

  #Draw training data
  x1, x2 = np.array(cls1).transpose()
  plt.plot(x1, x2, 'rx')

  x1, x2 = np.array(cls2).transpose()
  plt.plot(x1, x2, 'bx')
  
  sv = svm(kernel='linear',degree=3)
  #Learn svm
  sv.train_svm(X,t)
  #Draw support vector
  for n in sv.sv:
    plt.scatter(X[n,0], X[n,1], s=80, c='c', marker='o')
  #Draw identification boundaries
  step = 0.1
  f0,f1  = np.meshgrid(np.arange(np.min(X[:,0])-0.5, np.max(X[:,0])+0.5, step), np.arange(np.min(X[:,1])-0.5, np.max(X[:,1])+0.5, step))
  out = sv.classifier(np.c_[np.ravel(f0),np.ravel(f1)]).T
  out = out.reshape(f0.shape)
  plt.contour(f0,f1,out,1)

  plt.show()

6. References

6.1. Hesse's official

When in 2D http://www004.upp.so-net.ne.jp/s_honma/urawaza/distance.htm When in 3D http://yosshy.sansu.org/distance1.htm

6.2. Lagrange's undetermined multiplier method

http://fd.kuaero.kyoto-u.ac.jp/sites/default/files/Lagrange1.pdf https://ja.wikipedia.org/wiki/%E3%83%A9%E3%82%B0%E3%83%A9%E3%83%B3%E3%82%B8%E3%83%A5%E3%81%AE%E6%9C%AA%E5%AE%9A%E4%B9%97%E6%95%B0%E6%B3%95

6.3. KKT conditions

http://www.msi.co.jp/nuopt/glossary/term_da98193bc878eb73f1624989004cfa809e13590a.html 6.4. SVM Machine Learning: An Algorithmic Perspective, Second Edition (Chapman & Hall/Crc Machine Learning & Pattern Recognition) [Data Science 6 Machine Learning Learned with R](http://www.amazon.co.jp/%E3%83%9E%E3%82%B7%E3%83%B3%E3%83%A9%E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-R% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 82% B9-6-% E8% BE% BB% E8 % B0% B7-% E5% B0% 86% E6% 98% 8E / dp / 4320019261 / ref = sr_1_2? Ie = UTF8 & qid = 1453064931 & sr = 8-2 & keywords =% E3% 83% 9E% E3% 82% B7% E3 % 83% B3% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0) Masaaki Tsujitani (Author), Kunio Takezawa (Author), Akitetsu Kim (Edit) [Support Vector Machine](http://www.amazon.co.jp/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83%99%E3 % 82% AF% E3% 83% 88% E3% 83% AB% E3% 83% 9E% E3% 82% B7% E3% 83% B3-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E7% AB% B9% E5% 86% 85 -% E4% B8% 80% E9% 83% 8E / dp / 4061529064 / ref = sr_1_1? ie = UTF8 & qid = 1453064806 & sr = 8-1 & keywords =% E3% 82% B5% E3% 83% 9D% E3% 83% BC % E3% 83% 88% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3% 83% AB% E3% 83% 9E% E3% 82% B7% E3% 83% B3) Ichiro Takeuchi (Author), Masayuki Karasuyama (Author) [First pattern recognition](http://www.amazon.co.jp/%E3%81%AF%E3%81%98%E3%82%81%E3%81%A6%E3%81%AE% E3% 83% 91% E3% 82% BF% E3% 83% BC% E3% 83% B3% E8% AA% 8D% E8% AD% 98-% E5% B9% B3% E4% BA% 95-% E6% 9C% 89% E4% B8% 89 / dp / 4627849710 / ref = sr_1_1? ie = UTF8 & qid = 1453065064 & sr = 8-1 & keywords =% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3 % 81% A6% E3% 81% AE% E3% 83% 91% E3% 82% BF% E3% 83% BC% E3% 83% B3% E8% AA% 8D% E8% AD% 98) Yuzo Hirai (Author) Pegasos: Primal Estimated sub-GrAdient SOlver for SVM Shai Shalev-Shwartz,Yoram Singer,Nathan Srebro

Recommended Posts

Introduction to Machine Learning-Hard Margin SVM Edition-
Introduction to machine learning
[Python] Easy introduction to machine learning with python (SVM)
An introduction to machine learning
Super introduction to machine learning
Introduction to Linux Commands ~ LS-DYNA Edition ~
Introduction to machine learning Note writing
Introduction to TensorFlow --Hello World Edition
Introduction to python-Environmental preparation (mac edition)
Introduction to Machine Learning Library SHOGUN
Introduction to Deep Learning ~ Dropout Edition ~
Introduction to Python Django (2) Mac Edition
Introduction to Machine Learning: How Models Work
An introduction to OpenCV for machine learning
Introduction to ClearML-Easy to manage machine learning experiments-
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
An introduction to machine learning for bot developers
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
Introduction to PyQt
Introduction to Scrapy (2)
Introduction to Scrapy (4)
Introduction to discord.py (2)
[For beginners] Introduction to vectorization in machine learning
Introduction to discord.py
A quick introduction to the neural machine translation library
An introduction to machine learning from a simple perceptron
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Studying Machine Learning-Pandas Edition-
Introduction to PyQt4 Part 1
Introduction to Private Chainer
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Introduction to Socket API Learned in C Part 2 Client Edition