I am trying to understand the inference model used in machine learning. This time, I summarized a technique called kernel trick. Last time, I summarized a rough overview and implementation examples.
I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example) https://qiita.com/Fumio-eisan/items/f2553266a509d6f2d3a3
This kernel trick method was very difficult for me to understand, and I had a lot of trouble. I felt that I needed more mathematical knowledge than understanding logistic regression and neural networks. .. I would like to summarize it in an easy-to-understand manner, taking into account the points that are difficult to understand in order as much as possible.
The outline is below.
The support vector machine is a classifier that can classify non-linear boundaries (= hyperplane separation). If there is a boundary you want to divide as shown in the figure on the left below, formulate a function along that boundary to distinguish it. Now, there are three techniques (as far as I know) to create non-linear functions. One is a method called Taylor expansion, which is a method of ** approximating ** trigonometric functions such as $ sin and cos $ with higher-order functions. Next is a technique called the Fourier series. This method is often used in themes related to wave motion, and is expressed by the sum of $ sin and cos $, which also have periodicity, for a periodic wave. And the last is what is called the kernel method. This is a method of expressing with functions such as exponential function and polynomial.
Then, a non-linear regression is performed by this method called the kernel method. At this time, consider creating a function that matches the test data as shown in the figure below. At this time, a curve that does not resemble the function you want to find may be drawn. This is called overfitting. As a countermeasure, we will add a regularization term to prevent overfitting.
Create the optimal function in this way. From the next time, I would like to understand it while implementing it while chewing a little more.
Now let's move on to kernel regression. As I mentioned briefly earlier, I understand it as a means of drawing non-linear lines. A similar (thinking) expression is the Fourier series. The idea is to express the wave motion by adding sin and cos waves. At work, I used this technique to analyze the pulsations flowing through the gas pipes. Therefore, I thought that kernel regression had similar logic.
Now, kernel regression is represented as the sum of $ N $ kernel functions. This $ N $ number is the number of test data samples.
f({\bf x}) = \sum_{i=1}^{N} \alpha_i k({\bf x}^{(i)}, {\bf x})
The part represented by $ k ({\ bf x} ^ {(i)}, {\ bf x}) $ is called a kernel function. There are several types of this kernel function. Typical ones are the Gaussian kernel and the polynomial kernel. It is expressed by the following formula.
Gaussian kernel:
k({\bf x}, {\bf x}') = exp(-\beta \|{\bf x} - {\bf x}'\|^2)\\
Polynomial kernel:
k({\bf x}, {\bf x}') = ({\bf x} {\bf x}'+c)^p
This time, I would like to express it using the Gaussian kernel. In the Gaussian kernel, you can think of the value of $ β $ as the height of the absolute value and $ x'$ as the position of the number of samples. The function when $ β $ and'x'' are changed is shown below.
kernel.ipynb
import numpy as np
def kernel(xi, xj, beta=1):
return np.exp(- beta * np.sum((xi - xj)**2))
X = np.arange(-1.5, 1.5, 0.05)
Y = np.zeros(len(X))
for i in range(len(X)):
Y[i] = kernel(X[i], 0, 10)
X1 = np.arange(-1.5, 1.5, 0.05)
Y1 = np.zeros(len(X1))
for i in range(len(X1)):
Y1[i] = kernel(X[i], 0, 1)
X2 = np.arange(-1.5, 1.5, 0.05)
Y2 = np.zeros(len(X2))
for i in range(len(X2)):
Y2[i] = kernel(X[i], -1, 1)
plt.figure(figsize=(7, 4))
plt.plot(X2, Y2,label='beta=1,x=-1')
plt.plot(X1, Y1,label='beta=1')
plt.plot(X, Y,label='beta=10')
plt.legend()
plt.show()
You can see that a sharper function can be drawn by increasing $ β $. You can also see that by manipulating the value of $ x'$, the position of the apex shifts.
Now, let's talk about the composition of this kernel function. It is necessary to add up these Gaussian kernels to represent the actual non-linear function. In the program below, the Gaussian kernels generated are added together and plotted.
kernel.ipynb
def kernel(xi, xj, beta=1):
return np.exp(- beta * np.sum((xi - xj)**2))
X = np.arange(-1.5, 1.5, 0.01)
Y = np.zeros(len(X))
centers = [0,-1,1]
weights = [1,-1,2]
for i in range(len(X)):
for weight, center in zip(weights, centers):
Y[i] += weight * kernel(X[i], center, 10)
plt.figure(figsize=(7, 4))
plt.plot(X, Y)
plt.show()
I put in an appropriate number and calculated it, but I was able to express an appropriately messy non-linear shape.
Now that we know that we have a non-linear function, we would like to understand the procedure for creating an arbitrary non-linear function. For linear regression, the coefficients are optimized by minimizing the error function. The idea is the same for kernel regression. Optimized for $ N $ of $ alpha_ {0 to N} $ in $ N $ sample test data (eg $ (x ^ {(1)}, y ^ {(1)}) $) You can create a function that can express non-linearity by doing.
\hat{y} =
\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}) & k({\bf x}^{(2)}, {\bf x}) & \cdots & k({\bf x}^{(N)}, {\bf x})
\end{matrix}\right)
\left(\begin{matrix}
\alpha_0 \\
\alpha_1 \\
\vdots \\
\alpha_N
\end{matrix}\right)
And
{\bf y} =
\left(\begin{matrix}
y^{(1)} \\
y^{(2)} \\
\vdots \\
y^{(N)}
\end{matrix}\right)
,\;
K=\left(\begin{matrix}
k({\bf x}^{(1)}, {\bf x}^{(1)}) & k({\bf x}^{(2)}, {\bf x}^{(1)}) & \cdots & k({\bf x}^{(N)}, {\bf x}^{(1)}) \\
k({\bf x}^{(1)}, {\bf x}^{(2)}) & k({\bf x}^{(2)}, {\bf x}^{(2)}) & & \vdots \\
\vdots & & \ddots & \vdots \\
k({\bf x}^{(1)}, {\bf x}^{(N)}) & \cdots & \cdots & k({\bf x}^{(N)}, {\bf x}^{(N)})
\end{matrix}\right)
,\;
{\bf \alpha}
=
\left(\begin{matrix}
\alpha^{(1)} \\
\alpha^{(2)} \\
\vdots \\
\alpha^{(N)}
\end{matrix}\right)
Let's express the error function as $ R (α) $ and let it be the square of the difference between the test data values $ y $ and $ f (x) $.
\begin{align}
R({\bf \alpha}) &= \sum_{i=1}^{N}\left\{y^{(i)}- f({\bf x}^{(i)})\right\}^2
=\sum_{i=1}^{N}\left\{y^{(i)}- \sum_{i=1}^{N} \alpha_i k({\bf x}^{(i)}, {\bf x})\right\}^2 \\
&=(\begin{array} yy_1 - \alpha_1k(\bf x^{(1)}-\bf x)&y_2 - \alpha_2k(\bf x^{(2)}-\bf x) &\cdots&y_n - \alpha_Nk(\bf x^{(N)}-\bf x) \end{array})
\begin{pmatrix}
y_1 - \alpha_1k(\bf x^{(1)}-\bf x)\\
y_2 - \alpha_2k(\bf x^{(2)}-\bf x) \\
\vdots \\
y_n - \alpha_Nk(\bf x^{(N)}-\bf x)\\
\end{pmatrix}\\
&=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha})
\end{align}
The error function can be expressed as Also, regarding $ \ alpha $,
\bf y = \bf K\bf α \\
∴\bf α = \bf K^{-1} \bf y
Can be obtained by using the inverse matrix of the kernel function.
Now, I would like to reproduce the curve drawn by the $ cos $ function as a suitable data set.
kernel.ipynb
import pandas as pd
def wave_dataset(size, xlim=[0, 2], scale=None):
x = np.random.uniform(xlim[0], xlim[1], size)
y = np.cos(x)
if scale is not None:
noize = np.random.normal(0, scale, size)
y = y + noize
df = pd.DataFrame()
df['x'] = x
df['y'] = y
return df
data = wave_dataset(20, xlim=[-3.14, 3.14], scale=0.05)
plt.figure(figsize=(7, 4))
plt.scatter(data['x'].values, data['y'].values)
plt.ylim((-1.5, 1.5))
plt.show()
You can generate a random value with the np.random.uniform () method. I added this value to the value obtained by the $ cos $ function to generate a slightly different value.
Next, we will enter values in the kernel function. Then calculate $ \ alpha $ from the inverse matrix.
kernel.ipynb
from itertools import combinations_with_replacement
X = data['x'].values
Y = data['y'].values
#Hyperparameters
beta = 1
#The number of data
N = X.shape[0]
#Gramian matrix calculation
K = np.zeros((N, N))
for i, j in combinations_with_replacement(range(N), 2):
K[i][j] = kernel(X[i], X[j])
K[j][i] = K[i][j]
#Calculate weight
alpha = np.linalg.inv(K).dot(Y)
Now you're ready to go. Finally, let's generate and write a function by performing kernel regression using a dataset.
kernel.ipynb
#Kernel regression
def kernel_predict(X, x, alpha, beta):
Y = 0
for i in range(len(X)):
Y += alpha[i] * kernel(X[i], x, beta)
return Y
#Predict results by regression
X_axis = np.arange(-3.14, 3.14, 0.01)
Y_predict = np.zeros(len(X_axis))
for i in range(len(X_axis)):
Y_predict[i] = kernel_predict(X, X_axis[i], alpha, beta)
#Draw result
plt.figure(figsize=(7, 4))
##Observation data
plt.scatter(data['x'].values, data['y'].values)
##True function
plt.plot(X_axis, np.cos(X_axis), c='red')
##Predicted function
plt.plot(X_axis, Y_predict)
plt.ylim((-1.5, 1.5))
plt.show()
Blue is the function generated by kernel regression. Sure, it's close to the point of the test data, but it's different from the $ cos $ function. This is a problem with model overfitting. Actually, the value of $ α $ is as follows.
You can see that it contains a very large value on the order of $ 10 ^ {9-13} $. With this, the maximum value that can be obtained will also take a large value.
As a countermeasure, it will be solved by performing a procedure to insert a regularization term.
By the way, the regularization term is to optimize the squared error by adding a term called the regularization term.
\begin{align}
R({\bf \alpha})
&=({\bf y}- K {\bf \alpha})^{\mathrm{T}}({\bf y}- K {\bf \alpha}) + \lambda {\bf \alpha}^{\mathrm{T}} K {\bf \alpha}
\end{align}
Minimizing the loss function by adding $ \ lambda {\ bf \ alpha} ^ {\ mathrm {T}} K {\ bf \ alpha} $ is called ** ridge regression **. This term is called the ridge regression regularization term. Since we basically want to minimize the value of this loss function, we need to reduce $ α $ as well. If you ask for $ \ alpha $ under this condition, it can be expressed as follows.
{\bf \alpha} = (K+\lambda I_N)^{-1}{\bf y}
Using this idea, I would like to recalculate $ \ alpha $ and plot it again.
kernel.ipynb
#Coefficient of regularization term
lam = 0.5
alpha_r = np.linalg.inv(K + lam * np.eye(K.shape[0])).dot(Y)
#Predict results by regression
Y_predict_r = np.zeros(len(X_axis))
for i in range(len(X_axis)):
Y_predict_r[i] = kernel_predict(X, X_axis[i], alpha_r, beta)
#Draw result
plt.figure(figsize=(6.472, 4))
##Observation data
plt.scatter(data['x'].values, data['y'].values)
##True function
plt.plot(X_axis, np.cos(X_axis), c='red')
##Predicted function
plt.plot(X_axis, Y_predict_r)
plt.ylim((-1.5, 1.5))
plt.show()
It's done. I think I could say that I was able to reproduce it (although the area around $ x = 3 $ is an inflection point due to the lack of test data).
It may have been a rush, but I have summarized the procedure for creating a nonlinear function in a support vector machine. It was hard because I often had to understand it mathematically. However, on the other hand, I feel that the appeal of this algorithm became known by understanding the idea.
Here is an article that I have referred to a lot.
Linear method and kernel method (regression analysis) https://qiita.com/wsuzume/items/09a59036c8944fd563ff
The full program is here. https://github.com/Fumio-eisan/SVM_20200417
Recommended Posts