What is this article?

The Support Vector Machine (SVM) minimizes the following objective functions.

\min_{\bf w}\frac{\lambda}{2}\|{\bf w}\|^2+\frac{1}{m}\sum_{({\bf x},y)\in S}l({\bf w}; ({\bf x}, y)) \\
l({\bf w}; ({\bf x},y))=\max(0,1-y\left<{\bf w},{\bf x} \right>)

Online Machine Learning has an algorithm that minimizes this objective function by stochastic gradient descent. It didn't correspond to the trick.

So, when I was looking for an implementation of SVM using the kernel trick by stochastic gradient descent, [Pegasos: Primal Estimated sub-GrAdient SOlver for SVM](http://ttic.uchicago.edu/~nati/Publications/ I found a paper called PegasosMPB.pdf).

In this article, I will implement the algorithm shown in the paper.

Algorithm input and output

Implement Fig.3 in the paper. The input of the algorithm is as follows.

Training data: $ S = \ lbrace (x_i, y_i) \ rbrace_ {i = 1} ^ m $
Learning rate: $ \ lambda $ (To be exact, the learning rate of $ \ eta = \ frac {1} {\ lambda t} $ is used.)
Number of epochs: $ T $
Kernel function: $ K (x_i, x_j) $

Fig.Choose in 3$i_t \in \lbrace 0,\dots,|S| \rbracePart is Choosei_t \in \lbrace 1,\dots,|S|\rbrace$I think it's a mistake (probably).

The output of the algorithm is a vector of $ \ alpha_ {T + 1} $ with $ m $ of elements. By using the obtained $ \ alpha_ {T + 1} $, the prediction can be made as follows.

\hat{y}={\bf w}_{T+1}^{\rm T}\phi({\bf x})=\frac{1}{\lambda t}\sum_{j=1}^{m}\alpha_{T+1}[j]y_jK({\bf x}_j,{\bf x})

Implementation example by python

I implemented it like sklearn. Also, the algorithm has been modified to add a bias term. The source code has been uploaded to here.

`SGDSVC.py`


import numpy as np


class SGDSVC(object):

    def __init__(self,
                 kernel="rbf", lmd=1e-1, gamma=0.1, bias=1.0, max_iter=100):
        if kernel not in self.__kernel_dict:
            print(kernel + " kernel does not exist!\nUse rbf kernel.")
            kernel = "rbf"
        if kernel == "rbf":
            def kernel_func(x, y):
                return self.__kernel_dict[kernel](x,y,gamma=gamma)
        else:
            kernel_func = self.__kernel_dict[kernel]
        self.kernel = kernel_func
        self.lmd = lmd
        self.bias = bias
        self.max_iter = max_iter

    def __linear_kernel(x, y):
        return np.dot(x, y)

    def __gaussian_kernel(x, y, gamma):
        diff = x - y
        return np.exp(-gamma * np.dot(diff, diff))

    __kernel_dict = {"linear": __linear_kernel, "rbf": __gaussian_kernel}

    def fit(self, X, y):
        def update_alpha(alpha, t):
            data_size, feature_size = np.shape(self.X_with_bias)
            new_alpha = np.copy(alpha)
            it = np.random.randint(low=0, high=data_size)
            x_it = self.X_with_bias[it]
            y_it = self.y[it]
            if (y_it *
                (1. / (self.lmd * t)) *
                sum([alpha_j * y_it * self.kernel(x_it, x_j)
                     for x_j, alpha_j in zip(self.X_with_bias, alpha)])) < 1.:
                new_alpha[it] += 1
            return new_alpha

        self.X_with_bias = np.c_[X, np.ones((np.shape(X)[0])) * self.bias]
        self.y = y
        alpha = np.zeros((np.shape(self.X_with_bias)[0], 1))
        for t in range(1, self.max_iter + 1):
            alpha = update_alpha(alpha, t)
        self.alpha = alpha

    def decision_function(self, X):
        X_with_bias = np.c_[X, np.ones((np.shape(X)[0])) * self.bias]
        y_score = [(1. / (self.lmd * self.max_iter)) *
                   sum([alpha_j * y_j * self.kernel(x_j, x)
                        for (x_j, y_j, alpha_j) in zip(
                        self.X_with_bias, self.y, self.alpha)])
                   for x in X_with_bias]
        return np.array(y_score)

    def predict(self, X):
        y_score = self.decision_function(X)
        y_predict = map(lambda s: 1 if s >= 0. else -1, y_score)
        return y_predict

Execution example

Here is an example prediction for the iris dataset. In order to confirm the correctness of the implementation, the data used during the training is used as it is in the test. The parameters are $ \ lambda = 0.1, T = 100 , and the Gaussian kernel ( \ gamma = 0.1 $) is used for the kernel function. In addition, the training data is scaled so that the mean is 0 and the variance is 1.

`main.py`


# -*- coding: utf-8 -*-
import SGDSVC
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics

if __name__ == '__main__':
    iris = datasets.load_iris()
    X = iris.data[:100]
    y = iris.target[:100]
    #Label is 1, -1
    y = map(lambda l: 1 if l == 1 else -1, y)
    #scaling
    scaler = preprocessing.StandardScaler()
    X = scaler.fit_transform(X)
    clf = SGDSVC.SGDSVC()
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print("Confusion Matrix")
    print(metrics.confusion_matrix(y, y_pred))
    print(metrics.classification_report(y, y_pred))

The execution result is as follows.

Confusion Matrix
[[50  0]
 [ 0 50]]
             precision    recall  f1-score   support

         -1       1.00      1.00      1.00        50
          1       1.00      1.00      1.00        50

avg / total       1.00      1.00      1.00       100

As mentioned above, this is the result of using the data used during training as it is for the test. You can see that all the data points are correctly classified.

Other

The online version of SVM is Passive Aggressive, but Chapter 1 of this paper describes the difference from PA. Also, the name Pegasos is really cool.

(2015/7/8) Corrected the prediction formula because it was slightly wrong. (2015/7/10) Corrected Japanese. (2016/1/15) Corrected the link of the cited paper.

Implementation of SVM by stochastic gradient descent