It was difficult to understand kernel tricks in support vector machines (Part 2: Kernel regression, ridge regression, etc.)

Introduction

I am trying to understand the inference model used in machine learning. This time, I summarized a technique called kernel trick. Last time, I summarized a rough overview and implementation examples.

I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example) https://qiita.com/Fumio-eisan/items/f2553266a509d6f2d3a3

This kernel trick method was very difficult for me to understand, and I had a lot of trouble. I felt that I needed more mathematical knowledge than understanding logistic regression and neural networks. .. I would like to summarize it in an easy-to-understand manner, taking into account the points that are difficult to understand in order as much as possible.

The outline is below.

I want to create a non-linear function, so what should I do?

The support vector machine is a classifier that can classify non-linear boundaries (= hyperplane separation). If there is a boundary you want to divide as shown in the figure on the left below, formulate a function along that boundary to distinguish it. Now, there are three techniques (as far as I know) to create non-linear functions. One is a method called Taylor expansion, which is a method of ** approximating ** trigonometric functions such as $ sin and cos $ with higher-order functions. Next is a technique called the Fourier series. This method is often used in themes related to wave motion, and is expressed by the sum of $ sin and cos $, which also have periodicity, for a periodic wave. And the last is what is called the kernel method. This is a method of expressing with functions such as exponential function and polynomial.

image.png

Then, a non-linear regression is performed by this method called the kernel method. At this time, consider creating a function that matches the test data as shown in the figure below. At this time, a curve that does not resemble the function you want to find may be drawn. This is called overfitting. As a countermeasure, we will add a regularization term to prevent overfitting.

image.png

Create the optimal function in this way. From the next time, I would like to understand it while implementing it while chewing a little more.

Understand kernel regression

Now let's move on to kernel regression. As I mentioned briefly earlier, I understand it as a means of drawing non-linear lines. A similar (thinking) expression is the Fourier series. The idea is to express the wave motion by adding sin and cos waves. At work, I used this technique to analyze the pulsations flowing through the gas pipes. Therefore, I thought that kernel regression had similar logic.

image.png

Now, kernel regression is represented as the sum of $ N $ kernel functions. This $ N $ number is the number of test data samples.

f({\bf x}) = \sum_{i=1}^{N} \alpha_i k({\bf x}^{(i)}, {\bf x})

The part represented by $ k ({\ bf x} ^ {(i)}, {\ bf x}) $ is called a kernel function. There are several types of this kernel function. Typical ones are the Gaussian kernel and the polynomial kernel. It is expressed by the following formula.

Gaussian kernel:
k({\bf x}, {\bf x}') = exp(-\beta \|{\bf x} - {\bf x}'\|^2)\\

Polynomial kernel:
k({\bf x}, {\bf x}') =  ({\bf x} {\bf x}'+c)^p

Play around with the Gaussian kernel

This time, I would like to express it using the Gaussian kernel. In the Gaussian kernel, you can think of the value of $ β $ as the height of the absolute value and $ x'$ as the position of the number of samples. The function when $ β $ and'x'' are changed is shown below.

kernel.ipynb


import numpy as np

def kernel(xi, xj, beta=1):
    return np.exp(- beta * np.sum((xi - xj)**2))

X = np.arange(-1.5, 1.5, 0.05)
Y = np.zeros(len(X))
for i in range(len(X)):
    Y[i] = kernel(X[i], 0, 10)
    
X1 = np.arange(-1.5, 1.5, 0.05)
Y1 = np.zeros(len(X1))
for i in range(len(X1)):
    Y1[i] = kernel(X[i], 0, 1)
    
X2 = np.arange(-1.5, 1.5, 0.05)
Y2 = np.zeros(len(X2))
for i in range(len(X2)):
    Y2[i] = kernel(X[i], -1, 1)


plt.figure(figsize=(7, 4))
plt.plot(X2, Y2,label='beta=1,x=-1')
plt.plot(X1, Y1,label='beta=1')
plt.plot(X, Y,label='beta=10')
plt.legend()
plt.show()

015.png

You can see that a sharper function can be drawn by increasing $ β $. You can also see that by manipulating the value of $ x'$, the position of the apex shifts.

Now, let's talk about the composition of this kernel function. It is necessary to add up these Gaussian kernels to represent the actual non-linear function. In the program below, the Gaussian kernels generated are added together and plotted.

kernel.ipynb


def kernel(xi, xj, beta=1):
    return np.exp(- beta * np.sum((xi - xj)**2))

X = np.arange(-1.5, 1.5, 0.01)
Y = np.zeros(len(X))

centers = [0,-1,1]
weights = [1,-1,2]
for i in range(len(X)):
    for weight, center in zip(weights, centers):
        Y[i] += weight * kernel(X[i], center, 10)

plt.figure(figsize=(7, 4))
plt.plot(X, Y)
plt.show()

016.png

I put in an appropriate number and calculated it, but I was able to express an appropriately messy non-linear shape.

Create an optimized kernel function according to the test data

Now that we know that we have a non-linear function, we would like to understand the procedure for creating an arbitrary non-linear function. For linear regression, the coefficients are optimized by minimizing the error function. The idea is the same for kernel regression. Optimized for $ N $ of $ alpha_ {0 to N} $ in $ N $ sample test data (eg $ (x ^ {(1)}, y ^ {(1)}) $) You can create a function that can express non-linearity by doing.

\hat{y} =
\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}) & k({\bf x}^{(2)}, {\bf x}) & \cdots & k({\bf x}^{(N)}, {\bf x})
\end{matrix}\right)
\left(\begin{matrix}
\alpha_0 \\
\alpha_1 \\
\vdots \\
\alpha_N
\end{matrix}\right)

And

{\bf y} = 
\left(\begin{matrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(N)}
\end{matrix}\right)
,\;
K=\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}^{(1)}) & k({\bf x}^{(2)}, {\bf x}^{(1)}) & \cdots & k({\bf x}^{(N)}, {\bf x}^{(1)}) \\
k({\bf x}^{(1)}, {\bf x}^{(2)}) & k({\bf x}^{(2)}, {\bf x}^{(2)}) & & \vdots \\
\vdots & & \ddots & \vdots \\
k({\bf x}^{(1)}, {\bf x}^{(N)}) & \cdots & \cdots & k({\bf x}^{(N)}, {\bf x}^{(N)})
\end{matrix}\right) 
,\;
{\bf \alpha}
=
\left(\begin{matrix}
\alpha^{(1)} \\
\alpha^{(2)} \\
\vdots \\
\alpha^{(N)}
\end{matrix}\right)

Let's express the error function as $ R (α) $ and let it be the square of the difference between the test data values $ y $ and $ f (x) $.

\begin{align}
R({\bf \alpha}) &= \sum_{i=1}^{N}\left\{y^{(i)}- f({\bf x}^{(i)})\right\}^2
=\sum_{i=1}^{N}\left\{y^{(i)}- \sum_{i=1}^{N} \alpha_i k({\bf x}^{(i)}, {\bf x})\right\}^2 \\
&=(\begin{array} yy_1 - \alpha_1k(\bf x^{(1)}-\bf x)&y_2 - \alpha_2k(\bf x^{(2)}-\bf x)  &\cdots&y_n - \alpha_Nk(\bf x^{(N)}-\bf x) \end{array})
\begin{pmatrix}
y_1 - \alpha_1k(\bf x^{(1)}-\bf x)\\
y_2 - \alpha_2k(\bf x^{(2)}-\bf x) \\
\vdots \\
y_n - \alpha_Nk(\bf x^{(N)}-\bf x)\\ 
\end{pmatrix}\\

&=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})
\end{align}

The error function can be expressed as Also, regarding $ \ alpha $,


\bf y = \bf K\bf α \\
∴\bf α = \bf K^{-1} \bf y

Can be obtained by using the inverse matrix of the kernel function.

Now, I would like to reproduce the curve drawn by the $ cos $ function as a suitable data set.

kernel.ipynb


import pandas as pd

def wave_dataset(size, xlim=[0, 2], scale=None):
    x = np.random.uniform(xlim[0], xlim[1], size)
    y = np.cos(x)
    if scale is not None:
        noize = np.random.normal(0, scale, size)
        y = y + noize
    df = pd.DataFrame()
    df['x'] = x
    df['y'] = y
    return df

data = wave_dataset(20, xlim=[-3.14, 3.14], scale=0.05)

plt.figure(figsize=(7, 4))
plt.scatter(data['x'].values, data['y'].values)
plt.ylim((-1.5, 1.5))
plt.show()

017.png

You can generate a random value with the np.random.uniform () method. I added this value to the value obtained by the $ cos $ function to generate a slightly different value.

Next, we will enter values in the kernel function. Then calculate $ \ alpha $ from the inverse matrix.

kernel.ipynb


from itertools import combinations_with_replacement

X = data['x'].values
Y = data['y'].values

#Hyperparameters
beta = 1

#The number of data
N = X.shape[0]

#Gramian matrix calculation
K = np.zeros((N, N))
for i, j in combinations_with_replacement(range(N), 2):
    K[i][j] = kernel(X[i], X[j])
    K[j][i] = K[i][j]

#Calculate weight
alpha = np.linalg.inv(K).dot(Y)

Now you're ready to go. Finally, let's generate and write a function by performing kernel regression using a dataset.

kernel.ipynb


#Kernel regression
def kernel_predict(X, x, alpha, beta):
    Y = 0
    for i in range(len(X)):
        Y += alpha[i] * kernel(X[i], x, beta)
    return Y

#Predict results by regression
X_axis = np.arange(-3.14, 3.14, 0.01)
Y_predict = np.zeros(len(X_axis))
for i in range(len(X_axis)):
    Y_predict[i] = kernel_predict(X, X_axis[i], alpha, beta)

#Draw result
plt.figure(figsize=(7, 4))

##Observation data
plt.scatter(data['x'].values, data['y'].values)
##True function
plt.plot(X_axis, np.cos(X_axis), c='red')
##Predicted function
plt.plot(X_axis, Y_predict)

plt.ylim((-1.5, 1.5))
plt.show()

019.png

Blue is the function generated by kernel regression. Sure, it's close to the point of the test data, but it's different from the $ cos $ function. This is a problem with model overfitting. Actually, the value of $ α $ is as follows.

image.png

You can see that it contains a very large value on the order of $ 10 ^ {9-13} $. With this, the maximum value that can be obtained will also take a large value.

As a countermeasure, it will be solved by performing a procedure to insert a regularization term.

Understand ridge regression

By the way, the regularization term is to optimize the squared error by adding a term called the regularization term.

\begin{align}
R({\bf \alpha})
&=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha}) + \lambda {\bf \alpha}^{\mathrm{T}} K {\bf \alpha}
\end{align}

Minimizing the loss function by adding $ \ lambda {\ bf \ alpha} ^ {\ mathrm {T}} K {\ bf \ alpha} $ is called ** ridge regression **. This term is called the ridge regression regularization term. Since we basically want to minimize the value of this loss function, we need to reduce $ α $ as well. If you ask for $ \ alpha $ under this condition, it can be expressed as follows.

{\bf \alpha} = (K+\lambda I_N)^{-1}{\bf y}

Using this idea, I would like to recalculate $ \ alpha $ and plot it again.

kernel.ipynb



#Coefficient of regularization term
lam = 0.5
alpha_r = np.linalg.inv(K + lam * np.eye(K.shape[0])).dot(Y)

#Predict results by regression
Y_predict_r = np.zeros(len(X_axis))
for i in range(len(X_axis)):
    Y_predict_r[i] = kernel_predict(X, X_axis[i], alpha_r, beta)

#Draw result
plt.figure(figsize=(6.472, 4))

##Observation data
plt.scatter(data['x'].values, data['y'].values)
##True function
plt.plot(X_axis, np.cos(X_axis), c='red')
##Predicted function
plt.plot(X_axis, Y_predict_r)

plt.ylim((-1.5, 1.5))
plt.show()

020.png

It's done. I think I could say that I was able to reproduce it (although the area around $ x = 3 $ is an inflection point due to the lack of test data).

in conclusion

It may have been a rush, but I have summarized the procedure for creating a nonlinear function in a support vector machine. It was hard because I often had to understand it mathematically. However, on the other hand, I feel that the appeal of this algorithm became known by understanding the idea.

Here is an article that I have referred to a lot.

Linear method and kernel method (regression analysis) https://qiita.com/wsuzume/items/09a59036c8944fd563ff

The full program is here. https://github.com/Fumio-eisan/SVM_20200417

Recommended Posts

It was difficult to understand kernel tricks in support vector machines (Part 2: Kernel regression, ridge regression, etc.)
PyTorch's book was difficult to understand, so I supplemented it
I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example).