This time, I will summarize the theory about the support vector machine, which is one of the machine learning algorithms.
I would appreciate it if you could get along with me.
Let's first summarize the theory of support vector machines.
Support vector machine (svm) is one of the machine learning algorithms often used in the field of data analysis because of its generalization performance and wide range of application fields.
It is mainly used for binary classification problems based on the idea of maximizing margins. It can also be applied to multi-class classification and regression problems.
It has the weakness that it is not suitable for large datasets because of its high computational cost compared to other machine learning algorithms.
Margins that assume linearly separable (divided into two by one straight line) data are called hard margins
, and margins that allow erroneous discrimination based on non-linearly separable data are called soft margins
. I will.
I wrote that linear separability can be divided into two by one straight line, but since this is only for two-dimensional data, I generalize the concept of linear separability to a set in `n-dimensional space. The ability to be separated in an n-1 dimensional hyperplane is defined as linear separability.
When data on a two-dimensional plane can be classified by one-dimensional lines, it is said to be linearly separable. It can also be said that linear separability is possible when data in three-dimensional space can be classified by a two-dimensional plane.
In this way, the n-1 dimensional plane (not strictly a plane) that classifies n dimensional data is called the separation superplane
, and the distance between the separation superplane and the data closest to the separation superplane. Is called the margin
, and the goal of this algorithm is to maximize this margin.
Also, the data that is closest to the separation hyperplane
is called the support vector
. Illustrated below.
It's intuitively clear that you can increase the accuracy by creating a hyperplane
that maximizes the margin
shown in the figure.
This time, the data is represented in two dimensions for the sake of illustration, but think of the data in n-dimensional space as being divided by an n-1 dimensional hyperplane.
On a two-dimensional number vector space, a straight line that divides two data can be expressed as $ ax + by + c = 0 $ as shown in the above figure, and the parameters $ a, b, c $ are adjusted. By doing so, you can represent all straight lines.
This time, we are assuming a hyperplane on an n-dimensional number vector space, so the formula for that hyperplane is given by the following formula. Now, consider the case where there are N data in total.
Now let's use this hyperplane equation to derive the equation used to optimize the hard margin (linearly separable problem).
Calculating the part of $ W ^ TX_i + b = 0 $ gives $ w_1x_1 + w_2x_2 + ... w_nx_n + b = 0 $, which is a two-dimensional linear equation $ ax + by + c = 0 $ You can intuitively understand that it is a hyperplane equation that extends to n dimensions.
Considering that the triangular data in the figure belongs to the set of $ K_1 $ and the star data in the figure belongs to the set of $ K_2 $, we can see that the following equation is satisfied.
W^TX_i + b > 0 \quad (X_i \in K_1)\\
W^TX_i + b < 0 \quad (X_i \in K_2)
Introduce the label variable t to represent this expression together.
Let $ t_i = 1 $ when the i-th data $ x_i $ belongs to class 1 and $ t_i = -1 $ when it belongs to class 2.
t_i = \left\{
\begin{array}{ll}
1 & (X_i \in K_1) \\
-1 & (X_i \in K_2)
\end{array}
\right.
Using $ t_i $ defined in this way, the conditional expression can be expressed as follows.
In this way, the conditional expression could be expressed in one line.
The margin is the distance between a point in n-dimensional space and a hyperplane, so let's review the distance between a point and a straight line. The distance between a two-dimensional point and a straight line is expressed by the following formula, where $ A (x_0, y_0) $ is the point and $ l: ax + by + c = 0 $ is the straight line.
d = \frac{|ax_0 + by_0 + c|}{\sqrt{a^2+b^2}}
The distance between one point in n-dimensional space and the hyperplane is expressed by the following formula.
d = \frac{|w_1x_1 + w_2x_2... + w_nx_n + b|}{\sqrt{w_1^2+w_2^2...+w_n^2}} = \frac{|W^TX_i + b|}{||w||}
Therefore, the condition for maximizing the margin M from the above equations is expressed by the following equation.
max_{w, b}M, \quad \frac{t_i(W^TX_i + b)}{||W||} \geq M \quad (i = 1, 2, 3, ...N)
I'm not sure, so I'll explain it.
Some data
Also,
Therefore, finding M that satisfies this formula optimizes the support vector machine.
here,$\frac{t_i(W^TX_i + b)}{||W||} \geq M $Divide both sides of by M and introduce the following conditions.
\frac{W}{M||W||} = \tilde{W}\\
\frac{b}{M||W||} = \tilde{b}
Then, the conditional expression of the optimization problem is expressed as follows.
t_i(\tilde{W^T}X_i + \tilde{b}) \geq 1
The above formula holds for all data, but $ X_i $ when the equal sign holds is $ X_i $ for the closest data.
In other words, $ \ tilde {M} $, which is a simplified margin M, is expressed by the following formula.
\tilde{M} = \frac{t_i(\tilde{W^T}X_i + \tilde{b})}{||\tilde{W}||} = \frac{1}{||\tilde{W}||}
With this equation transformation, the optimization problem becomes as follows.
max_{\tilde{W}, \tilde{b}}\frac{1}{||\tilde{W}||}, \quad t_i(\tilde{W^T}X_i + \tilde{b}) \geq 1 \quad (i = 1, 2, 3, ...N)
It's getting pretty difficult. Keep it on.
The tilde was attached due to the transformation of the formula on the way, but let's remove it for the sake of simplicity. And
min_{W, b}\frac{1}{2}||W||^2, \quad t_i(W^TX_i + b)\geq 1 \quad (i = 1, 2, 3, ...N)
Solving the above equation, that is
However, this condition can only solve linearly separable problems. That is, it can only be applied to hard margins.
Let's relax the constraints so that this formula can also be applied to soft margins.
By loosening the constraint $ t_i (W ^ TX_i + b) \ geq 1 $ in the above equation, let's deal with the problem of linear separability (soft margin).
Illustrated below.
Consider the problem of linear separability as shown in this figure. As shown by the red arrow in the figure, the data has entered the inside of the margin.
Considering the story so far, it is natural that the support vector (data closest to the hyperplane) exists on the hyperplane that satisfies $ W ^ TX_i + b = 1 $.
The data indicated by the red arrow in the figure does not satisfy $ t_i (W ^ TX_i + b) \ geq 1 $, but it may meet the condition $ t_i (W ^ TX_i + b) \ geq 0.5 $.
Therefore, let's relax the constraints by introducing the slug variable $ \ xi $. Define it as follows.
t_i(W^TX_i + b)\geq 1 - \xi_i \\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}
From the above formula, we will loosen the constraint only if the data is inside the margin.
Therefore, by introducing this slug variable, the margin optimization problem becomes as follows.
min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad constraint\quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N
Trying to maximize the margin, that is
It is a C hyperparameter, and we will build the model while adjusting it.
So far, we have derived the formulas for the optimization problem in hard and soft margins. It is summarized below.
min_{W, b}\frac{1}{2}||W||^2, \quad t_i(W^TX_i + b)\geq 1 \quad (i = 1, 2, 3, ...N)
min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad \quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N
The soft margin was used for linearly separable problems, and the hard margin was used for linearly separable problems.
Now let's think about solving the optimization problem.
When solving this optimization problem, it is rare to solve the above equation directly.
The above formula is called the main problem
of the optimization problem, but in many cases, instead of solving this main problem
directly, this main problem
is called the dual problem
in another form. We will solve the optimization problem by converting it to a mathematical formula and solving the mathematical formula.
Now let's use Lagrange's undetermined multiplier method to solve this optimization problem.
Please refer to the article here for the Lagrange's undetermined multiplier method.
Please note that I do not fully understand it, so I think there are some parts that lack rigor. I will explain briefly.
Lagrange's undetermined multiplier method is a typical method for constrained optimization problems.
Consider the case where the objective function $ f (X) $ is minimized under the condition of n inequality constraints $ g (X) _i \ leqq0, i = 1, 2, 3, ... n $.
First, define the following Lagrange function.
L(X, α) = f(X) + \sum_{i=1}^{n}α_ig_i(X)
This inequality-constrained optimization problem results in the problem of finding $ (\ tilde {X}, \ tilde {α}) $ that satisfy the following four conditions for the Lagrange function.
\frac{\partial L(X, α)}{\partial X}=0\\
\frac{\partial L(X, α)}{\partial α_i} = g_i(X)\leqq 0, \quad (i=1, 2,... n)\\
0 \leqq α, \quad (i = 1,2, ...n)\\
α_ig_i(X) = 0, \quad (i = 1, 2,...n)
In this way, instead of solving the optimization problem directly, it is possible to solve the optimization problem using another equation by using Lagrange's undetermined multiplier method. You called this other formula the dual problem
.
Now let's apply Lagrange's undetermined multiplier method to the soft margin equation of the support vector machine.
The objective function is:
min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\}
The inequality constraints are:
t_i(W^TX_i + b)\geq 1 - \xi_i \quad \xi_i \geq 0 \quad i = 1, 2,...N
This time, all n data have two inequality constraints, so if the Lagrange multipliers are α and β, the Lagrange function is as follows.
L(W,b,\xi,α,β)=\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i-\sum_{i=1}^{N}α_i\bigl\{t_i(W^TX_i+b)-1+\xi_i\bigl\}-\sum_{i=1}^{N}β_i\xi_i
When solving an optimization problem, the following conditions are met:
\frac{\partial L(W,b,\xi,α,β)}{\partial W}= W - \sum_{i=1}^{N}α_it_iX_i=0\\
\frac{\partial L(W,b,\xi,α,β)}{\partial b}= -\sum_{i=1}^{N}α_it_i = 0\\
\frac{\partial L(W,b,\xi,α,β)}{\partial W} = C - α_i -β_i = 0
The following is a summary of these three formulas.
W =\sum_{i=1}^{N}α_it_iX_i\\
\sum_{i=1}^{N}α_it_i = 0\\
C = α_i + β_i
Substituting these three formulas into the Lagrange function and doing our best to calculate, the formula with only the variable α is as shown below.
\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j
Also, since α is 0 or more, the dual problem requires α that satisfies the following conditions.
max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N
In this way, we were able to derive the equation for the duality problem of the support vector machine in the soft margin.
Now, let's summarize the kernel method, which is one of the methods to easily solve this dual problem.
Let's explain the kernel method.
Now let's take a look at the quote from Wikipedia.
The kernel method (kernel method, English: kernel method) is one of the methods used in pattern recognition, and is used in combination with algorithms such as discrimination. A well-known method is to use it in combination with a support vector machine. The purpose of pattern recognition is generally to find and study the structure of data (eg clusters, rankings, principal components, correlations, classifications). To achieve this goal, the kernel method maps data onto a high-dimensional feature space. Each coordinate of the feature space corresponds to one feature of the data element, and the set of data is transformed into a set of points in the Euclidean space by mapping to the feature space (feature mapping). Various methods are used in combination with the kernel method to analyze the structure of data in the feature space. A variety of maps can be used as feature maps (generally nonlinear maps are used), and correspondingly various structures of data can be found.
You can think of the kernel method as a method of mapping and separating low-dimensional data to high dimensions.
Strictly different, but a rough understanding is fine here.
Now let's see why the kernel method is used in support vector machines.
Consider the case of classifying the following two types of data.
In the case of such two-dimensional data, it is not possible to separate two types of data with a one-dimensional straight line.
Let's extend this data to multidimensional data to address this linearly inseparable problem.
Specifically, when expanding the two-dimensional data $ X = (x_1, x_2) $ to five dimensions, it is mapped through the following function.
In this way, the data dimension extended to a higher dimension is called the high-dimensional feature space
, while the space of the first input data is called the input space
.
Let's generalize the above formula. The function that maps the data in the n-dimensional input space to the higher-dimensional r-dimensional feature space is defined as follows.
ψ(X) = (φ_1(X), φ_2(X), φ_3(X), ...φ_r(X))
Functions such as $ φ_1 (X) $ are functions that combine the data of the original function to make changes.
When data is expanded into a high-dimensional feature space using such a function, it becomes data that can be separated by a separation hyperplane at a certain stage. In fact, ultimately, if each piece of data is extended to another dimension, and if there are n data, it can be extended to n dimensions, it can always be separated by an n-1 dimension separation hyperplane.
In other words, it turns into linearly separable data.
After that, by back-mapping this separated superplane and converting it to the separated superplane of the original data, the curve that separates the data in the input space (strictly speaking, the curve is expanded to a dimension one dimension smaller than the input space. You can get what you did).
Now let's consider the formula of the optimization problem in the high-dimensional feature space.
Before considering the formula of the optimization problem of the high-dimensional feature space, it is a review of the optimization problem of the input space.
max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N
The data in the high-dimensional feature space is a mapping of the data in the input space $ X_i ^ T $, $ X_j $ 1 using the function $ ψ (X) $, so the optimization problem in the high-dimensional feature space is as follows. Will be.
max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{ψ(X)_i}^Tψ(X)_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N
You can see that this optimization problem should be solved in the high-dimensional feature space.
The issues here are:
{ψ(X)_i}^Tψ(X)_j
The higher the dimension of the feature space, the more ridiculous the amount of calculation in this term is.
A method that simplifies the calculation of this part is called a kernel trick.
Define the kernel function
as follows:
K(X_i, X_j) = {ψ(X)_i}^Tψ(X)_j
It's a bit tricky, but you can use this kernel function to calculate the dot product without directly calculating $ ψ (X) $.
In this way, in order to calculate the inner product without directly calculating $ ψ (X) $, it is necessary to satisfy certain conditions, but ~~ I don't understand ~~ It's difficult to explain, so it's helpful. I will post only the site that becomes.
Mercer's theorem [Definite Kernel](http://ibisforest.org/index.php?%E6%AD%A3%E5%AE%9A%E5%80%A4%E3%82%AB%E3%83%BC%E3 % 83% 8D% E3% 83% AB)
This method is very useful because $ ψ (X) $ only appears in the form of an inner product in the dual problem.
The following three kernel functions are used.
Gauss kernel
K(X_i, X_j) = exp\bigl\{-\frac{||X_i -X_j||^2}{2σ^2}\bigl\}
Polynomial kernel
K(X_i, X_j) = (X_i^TX_j + c)^d
Sigmoid kernel
K(X_i, X_j) = tanh(bX_i^TX_j + c)
Now let's look at a concrete example where the inner product can be easily calculated by the polynomial kernel.
Consider a function that maps a 2D input space to a 3D feature space as shown below.
ψ(X) = ψ(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)
Using this function, the two 2D vectors X and Y are as follows.
ψ(X) = ψ(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\\
ψ(Y) = ψ(y_1, y_2) = (y_1^2, \sqrt{2}y_1y_2, y_2^2)
Let's consider the inner product of these.
\begin{align}
ψ(X)^Tψ(Y) & = (x_1^2, \sqrt{2}x_1x_2, x_2^2)^T(y_1^2, \sqrt{2}y_1y_2, y_2^2)\\
&=x_1^2y_1^2 + 2x_1y_1x_2y_2 + x_2^2y_2^2\\
&= (x_1y_1 + x_2y_2)^2\\
&=((x_1,x_2)^T(y_1,y_2))^2\\
&=(X^TY)^2
\end{align}
In this way, $ ψ (X) ^ Tψ (Y) $ can be calculated by squared the inner product of the original vector without directly calculating $ ψ (X) ^ Tψ (Y) $. I will.
If you want to know more about the kernel method, please refer to the article here.
Now, let's summarize the implementation of the support vector machine.
We will implement svm that separates linearly separable data.
The data used is the iris dataset.
iris data is data of a flower variety called iris.
There are 50 data on each of the three iris varieties, Setosa
, Virginica
, and Virginica
, for a total of 150 data.
Let's actually look at the contents.
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
print(iris_df.head())
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
Since each column name is stored in iris.feature_names
, you can output the above data by passing it to the argument of Dataframe of pandas.
The Sepal Length
stores the sepal valve length, the Sepal Width
stores the sepal valve width, the Petal length
stores the petal length, and the Petal Width
stores the petal width data. I am.
You can display the correct label as shown below.
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
In this way, the iris varieties setosa
, versicolor
, and virginica
are set to 0, 1, and 2, respectively.
This is the end of the explanation of iris data.
Let's create a dataset with the following code.
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import mglearn
iris = load_iris()
X = iris.data[:100, 2:]
Y = iris.target[:100]
print(X.shape)
print(Y.shape)
(100, 2) (100,)
This time, we will classify using the data of petal length
and petal width
of setosa
and versicolor
.
Draw the data with the following code.
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.legend(['setosa', 'versicolor'], loc='best')
plt.show()
The code of mglearn.discrete_scatter (X [:, 0], X [:, 1], Y)
takes the X-axis as the first argument, the Y-axis as the second argument, and the correct label as the third argument, and scatter. Make a plot.
loc ='best'
adjusts the legend so that it does not get in the way of the graph.
From the data above, you can clearly see that it can be separated by a straight line. It's rather too easy.
Let's create a model with the following code.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
svm = LinearSVC()
svm.fit(X_train, Y_train)
The model creation itself ends with this code. It's easy.
Let's illustrate what the model looks like with the code below.
plt.figure(figsize=(10, 6))
mglearn.plots.plot_2d_separator(svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(['setosa', 'versicolor'], loc='best')
plt.show()
You can see that the boundaries that separate the data are created.
The part mglearn.plots.plot_2d_separator (svm, X)
is a little confusing, so I will explain it. Let's check the definition code.
plot_2d_separator(classifier, X, fill=False, ax=None, eps=None, alpha=1,cm=cm2, linewidth=None, threshold=None,linestyle="solid"):
It is a function that draws a boundary line when you pass the classification model as the first argument and the original data as the second argument.
This concludes the implementation of the svm model for linear separable problems.
This time we will deal with the issue of soft margins.
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import mglearn
iris = load_iris()
X = iris.data[50:, 2:]
Y = iris.target[50:] - 1
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.legend(['versicolor', 'virginica'], loc='best')
plt.show()
Now we are plotting the data for petal length
and petal width
for versicolor
and verginica
.
It's an impossible problem to achieve a perfect linear separation.
Here is a review of the soft margin formula. For derivation, refer to the article here.
min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad \quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N
Since the data gets inside the margin, we relaxed the limit by the term $ C \ sum_ {i = 1} ^ {N} \ xi_i $.
The value of this C is 1.0 by default in skleaarn
. Let's change this number and see how the figure changes. The following code defines a function that plots the model boundaries given as arguments.
def make_separate(model):
mglearn.plots.plot_2d_separator(svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(['setosa', 'versicolor'], loc='best')
plt.show()
Let's draw the figure with the following code. Let C = 0.1
.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
svm = LinearSVC(C=0.1)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))
0.96
Next is C = 1.0
.
svm = LinearSVC(C=1.0)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))
1.0
Next is C = 100
.
svm = LinearSVC(C=100)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))
1.0
It is important to set an appropriate C. It seems good to see the situation while changing various things.
This is the end of the soft margin implementation.
Now, let's summarize the implementation when the kernel method is used and the implementation when it is not used.
This time, we will classify problems that are not linearly separable without using the kernel method.
Here, the method that does not use the kernel function is defined as not using the kernel method.
Let's prepare the data with the following code and illustrate it.
import mglearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
moons = make_moons(n_samples=300, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]
plt.figure(figsize=(12, 8))
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.plot()
plt.show()
make_moons
is a function that creates data shaped like a two-dimensional moon.
You can set the number of samples and noise.
As you can see from the figure, it is obviously not linearly separable.
To transform this non-linearly separable data into linearly separable data, let's map the data in this input space to the data in the higher dimensional feature space.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)
Now you can map the data in the input space to the higher dimensional feature space.
Let's check what kind of data was mapped.
print(poly.get_feature_names())
print(X_train_poly.shape)
['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2'] (225, 6)
In this way, the 2D input space is extended to the 6D feature space.
Standardize the data with the following code.
scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.fit_transform(X_test_poly)
Data standardization is to set the mean of the data to 0 and the variance to 1 by subtracting the mean for all the data and then dividing by the standard deviation.
I wrote it in an easy-to-understand article here, so please refer to it.
Now let's implement and evaluate the model with the following code.
lin_svm = LinearSVC()
lin_svm.fit(X_train_poly_scaled, Y_train)
print(lin_svm.score(X_test_poly_scaled, Y_test))
0.84
It's a little low. Let's map it to a higher dimension.
However, the process of mapping to a higher dimension and standardizing is troublesome, so let's use something called Pipeline
.
poly_scaler_svm = Pipeline([
('poly', PolynomialFeatures(degree=3)),
('scaler', StandardScaler()),
('svm', LinearSVC())
])
poly_scaler_svm.fit(X_train, Y_train)
print(poly_scaler_svm.score(X_test, Y_test))
0.9733333333333334
In this way, Pipeline
can be used to simplify the task of mapping data to a higher dimension, standardizing it, and putting it in the svm model. By setting degree = 3
, it is mapped to a higher dimensional feature space.
The accuracy is pretty good. It is quite effective when mapped to a higher dimension.
Next, let's draw this figure. Below is the code.
_x0 = np.linspace(-1.5, 2.7, 100)
_x1 = np.linspace(-1.5, 1.5, 100)
x0, x1 = np.meshgrid(_x0, _x1)
X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))
y_decision = model.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)
plt.figure(figsize=(12, 8))
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.show()
You can see that the lines are pretty clean. Let's explain the code.
_x0 = np.linspace(-1.5, 2.7, 100)
_x1 = np.linspace(-1.5, 1.5, 100)
x0, x1 = np.meshgrid(_x0, _x1)
The grid points are created with the code here. Please refer to the article here for easy understanding.
np.linspace
creates a numpy array by specifying the start point in the first argument, the end point in the second argument, and the number of points in the third argument. By passing it to np.meshgrid
, we are creating 100x100 grid points.
X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))
After converting a 100 × 100 array to a one-dimensional array by (x0.ravel ()
, convert it to a two-dimensional 10000 × 1 matrix by reshape (-1, 1)
, and np.hstack By
, they are connected in the horizontal direction of axis = 1
, that is, X is a 10000 × 2 matrix.
y_decision = model.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)
The distance between 10000 grid points and the separated hyperplane is calculated by model.decision_function (X)
and converted into 100 × 100 data.
plt.contourf
is a function that illustrates contour lines, and you can specify in levels
where to change the color.
This completes the implementation that does not use the kernel method.
Now, let's implement it using the kernel method.
Let's prepare the data. So far the same.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
moons = make_moons(n_samples=300, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
Let's implement the model with the following code.
karnel_svm = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='poly', degree=3, coef0=1))
])
karnel_svm.fitX_train, Y_train()
By specifying poly
in the karnel
argument of SVC
, you can specify the polynomial kernel, and by specifying degree = 3
, you can think of mapping up to three dimensions.
You have now created a model. Next, let's illustrate this model. I do the same thing again, but it's a hassle, so I'll make it a function.
def plot_decision_function(model):
_x0 = np.linspace(-1.7, 2.7, 100)
_x1 = np.linspace(-1.5, 1.7, 100)
x0, x1 = np.meshgrid(_x0, _x1)
X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))
y_decision = model.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)
def plot_dataset(x, y):
plt.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'bo', ms=15)
plt.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'r^', ms=15)
plt.xlabel('$x_1$', fontsize=20)
plt.ylabel('$x_2$', fontsize=20, rotation=0)
plt.figure(figsize=(12, 8))
plot_decision_function(karnel_svm)
plot_dataset(X, Y)
plt.show()
I could have plotted it with mglearn
, but this time I plotted it with plt.plot
. The ones with Y = 0
are drawn with blue circles, and the ones with Y = 1
are drawn with red triangles.
As you can see, the same result is returned with or without the kernel method. However, the kernel method is much easier to calculate internally, so I think it is better to use the kernel method as much as possible.
Please refer to the article here for how to make it easier.
Thank you for staying with us so far.
It was a very long article. Thank you very much for reading this far.
Thank you for your hard work.
Recommended Posts