Introduction

This time, I will summarize the theory about the support vector machine, which is one of the machine learning algorithms.

I would appreciate it if you could get along with me.

Support vector machine theory

Let's first summarize the theory of support vector machines.

Hard margin and soft margin

Support vector machine (svm) is one of the machine learning algorithms often used in the field of data analysis because of its generalization performance and wide range of application fields.

It is mainly used for binary classification problems based on the idea of maximizing margins. It can also be applied to multi-class classification and regression problems.

It has the weakness that it is not suitable for large datasets because of its high computational cost compared to other machine learning algorithms.

Margins that assume linearly separable (divided into two by one straight line) data are called hard margins, and margins that allow erroneous discrimination based on non-linearly separable data are called soft margins. I will.

I wrote that linear separability can be divided into two by one straight line, but since this is only for two-dimensional data, I generalize the concept of linear separability to a set in `n-dimensional space. The ability to be separated in an n-1 dimensional hyperplane is defined as linear separability.

When data on a two-dimensional plane can be classified by one-dimensional lines, it is said to be linearly separable. It can also be said that linear separability is possible when data in three-dimensional space can be classified by a two-dimensional plane.

In this way, the n-1 dimensional plane (not strictly a plane) that classifies n dimensional data is called the separation superplane, and the distance between the separation superplane and the data closest to the separation superplane. Is called the margin, and the goal of this algorithm is to maximize this margin.

Also, the data that is closest to the separation hyperplane is called the support vector. Illustrated below.

It's intuitively clear that you can increase the accuracy by creating a hyperplane that maximizes the margin shown in the figure.

This time, the data is represented in two dimensions for the sake of illustration, but think of the data in n-dimensional space as being divided by an n-1 dimensional hyperplane.

On a two-dimensional number vector space, a straight line that divides two data can be expressed as $ ax + by + c = 0 $ as shown in the above figure, and the parameters $ a, b, c $ are adjusted. By doing so, you can represent all straight lines.

Hyperplane equation in n-dimensional number vector space

This time, we are assuming a hyperplane on an n-dimensional number vector space, so the formula for that hyperplane is given by the following formula. Now, consider the case where there are N data in total.

W^TX_i + b = 0 \quad (i = 1, 2, 3, ...N)

Now let's use this hyperplane equation to derive the equation used to optimize the hard margin (linearly separable problem).

Derivation of hard margin optimization formula

Calculating the part of $ W ^ TX_i + b = 0 $ gives $ w_1x_1 + w_2x_2 + ... w_nx_n + b = 0 $, which is a two-dimensional linear equation $ ax + by + c = 0 $ You can intuitively understand that it is a hyperplane equation that extends to n dimensions.

Considering that the triangular data in the figure belongs to the set of $ K_1 $ and the star data in the figure belongs to the set of $ K_2 $, we can see that the following equation is satisfied.

W^TX_i + b > 0 \quad (X_i \in K_1)\\
W^TX_i + b < 0 \quad (X_i \in K_2)

Introduce the label variable t to represent this expression together.

Let $ t_i = 1 $ when the i-th data $ x_i $ belongs to class 1 and $ t_i = -1 $ when it belongs to class 2.

t_i = \left\{
\begin{array}{ll}
1 & (X_i \in K_1) \\
-1 & (X_i \in K_2)
\end{array}
\right.

Using $ t_i $ defined in this way, the conditional expression can be expressed as follows.

t_i(W^TX_i + b) > 0 \quad (i = 1, 2, 3, ...N)

In this way, the conditional expression could be expressed in one line.

The margin is the distance between a point in n-dimensional space and a hyperplane, so let's review the distance between a point and a straight line. The distance between a two-dimensional point and a straight line is expressed by the following formula, where $ A (x_0, y_0) $ is the point and $ l: ax + by + c = 0 $ is the straight line.

d = \frac{|ax_0 + by_0 + c|}{\sqrt{a^2+b^2}}

The distance between one point in n-dimensional space and the hyperplane is expressed by the following formula.

d = \frac{|w_1x_1 + w_2x_2... + w_nx_n + b|}{\sqrt{w_1^2+w_2^2...+w_n^2}} = \frac{|W^TX_i + b|}{||w||}

Therefore, the condition for maximizing the margin M from the above equations is expressed by the following equation.

max_{w, b}M, \quad　\frac{t_i(W^TX_i + b)}{||W||} \geq M  \quad (i = 1, 2, 3, ...N)

I'm not sure, so I'll explain it.

Some dataX_aWhen you choseX_aAnd hyperplaneW^TX + b=0The distance to$　\frac{t_i(W^TX_a + b)}{||W||}$It is expressed as.

|W^TX_a + b|With the label variable tt_i(W^TX_a + b)It is expressed as.

Also,max_{w, b}MIs a variablew, bIt means to maximize M under$\frac{t_i(W^TX_i + b)}{||W||} \geq M $The condition means that the distance between the hyperplane and all the data should be larger than the margin M.

Therefore, finding M that satisfies this formula optimizes the support vector machine.

here,$\frac{t_i(W^TX_i + b)}{||W||} \geq M $Divide both sides of by M and introduce the following conditions.

\frac{W}{M||W||} = \tilde{W}\\
\frac{b}{M||W||} = \tilde{b}

Then, the conditional expression of the optimization problem is expressed as follows.

t_i(\tilde{W^T}X_i + \tilde{b}) \geq 1

The above formula holds for all data, but $ X_i $ when the equal sign holds is $ X_i $ for the closest data.

In other words, $ \ tilde {M} $, which is a simplified margin M, is expressed by the following formula.

\tilde{M} = \frac{t_i(\tilde{W^T}X_i + \tilde{b})}{||\tilde{W}||} = \frac{1}{||\tilde{W}||}

With this equation transformation, the optimization problem becomes as follows.

max_{\tilde{W}, \tilde{b}}\frac{1}{||\tilde{W}||}, \quad　t_i(\tilde{W^T}X_i + \tilde{b}) \geq 1 \quad (i = 1, 2, 3, ...N)

It's getting pretty difficult. Keep it on.

The tilde was attached due to the transformation of the formula on the way, but let's remove it for the sake of simplicity. And\frac{1}{||\tilde{W}||}For the part, it means maximizing the reciprocal of the norm, so let's convert the norm to the problem of minimizing the square for the sake of simplicity. The formula transformation of this part is a little push. To simplify later calculations\frac{1}{2}I will put on.

min_{W, b}\frac{1}{2}||W||^2, \quad t_i(W^TX_i + b)\geq 1 \quad (i = 1, 2, 3, ...N)

Solving the above equation, that ist_i(W^TX_i + b)\geq 1Under the condition\frac{1}{2}||W||^2You can maximize the margin by minimizing. This is the formula for the optimization problem when linearly separable.

However, this condition can only solve linearly separable problems. That is, it can only be applied to hard margins.

Let's relax the constraints so that this formula can also be applied to soft margins.

Derivation of soft margin optimization formula

By loosening the constraint $ t_i (W ^ TX_i + b) \ geq 1 $ in the above equation, let's deal with the problem of linear separability (soft margin).

Illustrated below.

Consider the problem of linear separability as shown in this figure. As shown by the red arrow in the figure, the data has entered the inside of the margin.

Considering the story so far, it is natural that the support vector (data closest to the hyperplane) exists on the hyperplane that satisfies $ W ^ TX_i + b = 1 $.

The data indicated by the red arrow in the figure does not satisfy $ t_i (W ^ TX_i + b) \ geq 1 $, but it may meet the condition $ t_i (W ^ TX_i + b) \ geq 0.5 $.

Therefore, let's relax the constraints by introducing the slug variable $ \ xi $. Define it as follows.

t_i(W^TX_i + b)\geq 1 - \xi_i \\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}

From the above formula, we will loosen the constraint only if the data is inside the margin.

Therefore, by introducing this slug variable, the margin optimization problem becomes as follows.

min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad constraint\quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N

Trying to maximize the margin, that is\frac{1}{2}||W||^2Of course, if you minimize, more data will come into the margin, soC\sum_{i=1}^{N} \xi_iWill increase. Therefore, this optimization problem is to be minimized while balancing the two contradictory terms.

It is a C hyperparameter, and we will build the model while adjusting it.

Review so far

So far, we have derived the formulas for the optimization problem in hard and soft margins. It is summarized below.

At the time of hard margin

min_{W, b}\frac{1}{2}||W||^2, \quad t_i(W^TX_i + b)\geq 1 \quad (i = 1, 2, 3, ...N)

At the time of soft margin

min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad \quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N

The soft margin was used for linearly separable problems, and the hard margin was used for linearly separable problems.

Solve optimization problems

Now let's think about solving the optimization problem.

When solving this optimization problem, it is rare to solve the above equation directly.

The above formula is called the main problem of the optimization problem, but in many cases, instead of solving this main problem directly, this main problem is called the dual problem in another form. We will solve the optimization problem by converting it to a mathematical formula and solving the mathematical formula.

Now let's use Lagrange's undetermined multiplier method to solve this optimization problem.

Please refer to the article here for the Lagrange's undetermined multiplier method.

Please note that I do not fully understand it, so I think there are some parts that lack rigor. I will explain briefly.

About Lagrange's undetermined multiplier method

Lagrange's undetermined multiplier method is a typical method for constrained optimization problems.

Consider the case where the objective function $ f (X) $ is minimized under the condition of n inequality constraints $ g (X) _i \ leqq0, i = 1, 2, 3, ... n $.

First, define the following Lagrange function.

L(X, α) = f(X) + \sum_{i=1}^{n}α_ig_i(X)

This inequality-constrained optimization problem results in the problem of finding $ (\ tilde {X}, \ tilde {α}) $ that satisfy the following four conditions for the Lagrange function.

 \frac{\partial L(X, α)}{\partial X}=0\\
\frac{\partial L(X, α)}{\partial α_i} = g_i(X)\leqq 0, \quad (i=1, 2,... n)\\
0 \leqq α, \quad (i = 1,2, ...n)\\
α_ig_i(X) = 0, \quad (i = 1, 2,...n)

In this way, instead of solving the optimization problem directly, it is possible to solve the optimization problem using another equation by using Lagrange's undetermined multiplier method. You called this other formula the dual problem.

Applies to optimization problems

Now let's apply Lagrange's undetermined multiplier method to the soft margin equation of the support vector machine.

The objective function is:

min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\}

The inequality constraints are:

t_i(W^TX_i + b)\geq 1 - \xi_i \quad \xi_i \geq 0 \quad i = 1, 2,...N

This time, all n data have two inequality constraints, so if the Lagrange multipliers are α and β, the Lagrange function is as follows.

L(W,b,\xi,α,β)=\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i-\sum_{i=1}^{N}α_i\bigl\{t_i(W^TX_i+b)-1+\xi_i\bigl\}-\sum_{i=1}^{N}β_i\xi_i

When solving an optimization problem, the following conditions are met:

\frac{\partial L(W,b,\xi,α,β)}{\partial W}= W - \sum_{i=1}^{N}α_it_iX_i=0\\
\frac{\partial L(W,b,\xi,α,β)}{\partial b}= -\sum_{i=1}^{N}α_it_i = 0\\
\frac{\partial L(W,b,\xi,α,β)}{\partial W} = C - α_i -β_i = 0

The following is a summary of these three formulas.

W =\sum_{i=1}^{N}α_it_iX_i\\
\sum_{i=1}^{N}α_it_i = 0\\
C = α_i + β_i

Substituting these three formulas into the Lagrange function and doing our best to calculate, the formula with only the variable α is as shown below.

\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j

Also, since α is 0 or more, the dual problem requires α that satisfies the following conditions.

max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N

In this way, we were able to derive the equation for the duality problem of the support vector machine in the soft margin.

Now, let's summarize the kernel method, which is one of the methods to easily solve this dual problem.

About the kernel method

Let's explain the kernel method.

Now let's take a look at the quote from Wikipedia.

The kernel method (kernel method, English: kernel method) is one of the methods used in pattern recognition, and is used in combination with algorithms such as discrimination. A well-known method is to use it in combination with a support vector machine. The purpose of pattern recognition is generally to find and study the structure of data (eg clusters, rankings, principal components, correlations, classifications). To achieve this goal, the kernel method maps data onto a high-dimensional feature space. Each coordinate of the feature space corresponds to one feature of the data element, and the set of data is transformed into a set of points in the Euclidean space by mapping to the feature space (feature mapping). Various methods are used in combination with the kernel method to analyze the structure of data in the feature space. A variety of maps can be used as feature maps (generally nonlinear maps are used), and correspondingly various structures of data can be found.

You can think of the kernel method as a method of mapping and separating low-dimensional data to high dimensions.

Strictly different, but a rough understanding is fine here.

Now let's see why the kernel method is used in support vector machines.

Why the kernel method is used in support vector machines

Consider the case of classifying the following two types of data.

In the case of such two-dimensional data, it is not possible to separate two types of data with a one-dimensional straight line.

Let's extend this data to multidimensional data to address this linearly inseparable problem.

Specifically, when expanding the two-dimensional data $ X = (x_1, x_2) $ to five dimensions, it is mapped through the following function.

ψ(X) = (x^2_1, x^2_2, x_1x_2, x_1, x_2)

In this way, the data dimension extended to a higher dimension is called the high-dimensional feature space, while the space of the first input data is called the input space.

Let's generalize the above formula. The function that maps the data in the n-dimensional input space to the higher-dimensional r-dimensional feature space is defined as follows.

ψ(X) = (φ_1(X), φ_2(X), φ_3(X), ...φ_r(X))

Functions such as $ φ_1 (X) $ are functions that combine the data of the original function to make changes.

When data is expanded into a high-dimensional feature space using such a function, it becomes data that can be separated by a separation hyperplane at a certain stage. In fact, ultimately, if each piece of data is extended to another dimension, and if there are n data, it can be extended to n dimensions, it can always be separated by an n-1 dimension separation hyperplane.

In other words, it turns into linearly separable data.

After that, by back-mapping this separated superplane and converting it to the separated superplane of the original data, the curve that separates the data in the input space (strictly speaking, the curve is expanded to a dimension one dimension smaller than the input space. You can get what you did).

Now let's consider the formula of the optimization problem in the high-dimensional feature space.

Before considering the formula of the optimization problem of the high-dimensional feature space, it is a review of the optimization problem of the input space.

Review of input space optimization problems

max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{X_i}^TX_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N

The data in the high-dimensional feature space is a mapping of the data in the input space $ X_i ^ T $, $ X_j $ 1 using the function $ ψ (X) $, so the optimization problem in the high-dimensional feature space is as follows. Will be.

max\Bigl\{{\tilde{L}(α) = \sum_{i=1}^{N}α_i - \frac{1}{2}\sum_{i=1}^{N}\sum_{i=j}^{N}α_iα_jt_it_j{ψ(X)_i}^Tψ(X)_j\Bigr\}}\\
\sum_{i=1}^{N}α_it_i = 0, \quad 0 \leqq α_i \leqq C, i = 1,2,...N

You can see that this optimization problem should be solved in the high-dimensional feature space.

Use the kernel method

The issues here are:

{ψ(X)_i}^Tψ(X)_j

The higher the dimension of the feature space, the more ridiculous the amount of calculation in this term is.

A method that simplifies the calculation of this part is called a kernel trick.

Define the kernel function as follows:

K(X_i, X_j) = {ψ(X)_i}^Tψ(X)_j

It's a bit tricky, but you can use this kernel function to calculate the dot product without directly calculating $ ψ (X) $.

In this way, in order to calculate the inner product without directly calculating $ ψ (X) $, it is necessary to satisfy certain conditions, but ~~ I don't understand ~~ It's difficult to explain, so it's helpful. I will post only the site that becomes.

Mercer's theorem [Definite Kernel](http://ibisforest.org/index.php?%E6%AD%A3%E5%AE%9A%E5%80%A4%E3%82%AB%E3%83%BC%E3 % 83% 8D% E3% 83% AB)

This method is very useful because $ ψ (X) $ only appears in the form of an inner product in the dual problem.

The following three kernel functions are used.

Gauss kernel

K(X_i, X_j) = exp\bigl\{-\frac{||X_i -X_j||^2}{2σ^2}\bigl\}

Polynomial kernel

K(X_i, X_j) = (X_i^TX_j + c)^d

Sigmoid kernel

K(X_i, X_j) = tanh(bX_i^TX_j + c)

Now let's look at a concrete example where the inner product can be easily calculated by the polynomial kernel.

Specific example of kernel method

Consider a function that maps a 2D input space to a 3D feature space as shown below.

ψ(X) = ψ(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)

Using this function, the two 2D vectors X and Y are as follows.

ψ(X) = ψ(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2)\\
ψ(Y) = ψ(y_1, y_2) = (y_1^2, \sqrt{2}y_1y_2, y_2^2)

Let's consider the inner product of these.

\begin{align}
ψ(X)^Tψ(Y) & = (x_1^2, \sqrt{2}x_1x_2, x_2^2)^T(y_1^2, \sqrt{2}y_1y_2, y_2^2)\\
&=x_1^2y_1^2 + 2x_1y_1x_2y_2 + x_2^2y_2^2\\
&= (x_1y_1 + x_2y_2)^2\\
&=((x_1,x_2)^T(y_1,y_2))^2\\
&=(X^TY)^2

\end{align}

In this way, $ ψ (X) ^ Tψ (Y) $ can be calculated by squared the inner product of the original vector without directly calculating $ ψ (X) ^ Tψ (Y) $. I will.

If you want to know more about the kernel method, please refer to the article here.

Now, let's summarize the implementation of the support vector machine.

Support vector machine implementation

Classification problem: hard margin

We will implement svm that separates linearly separable data.

The data used is the iris dataset.

About the iris dataset

iris data is data of a flower variety called iris.

There are 50 data on each of the three iris varieties, Setosa, Virginica, and Virginica, for a total of 150 data.

Let's actually look at the contents.

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

print(iris_df.head())

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2

Since each column name is stored in iris.feature_names, you can output the above data by passing it to the argument of Dataframe of pandas.

The Sepal Length stores the sepal valve length, the Sepal Width stores the sepal valve width, the Petal length stores the petal length, and the Petal Width stores the petal width data. I am.

You can display the correct label as shown below.

print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In this way, the iris varieties setosa, versicolor, and virginica are set to 0, 1, and 2, respectively.

This is the end of the explanation of iris data.

Implementation

Let's create a dataset with the following code.

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import mglearn

iris = load_iris()
X = iris.data[:100, 2:]
Y = iris.target[:100]
print(X.shape)
print(Y.shape)

(100, 2) (100,)

This time, we will classify using the data of petal length and petal width of setosa and versicolor.

Draw the data with the following code.

mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.legend(['setosa', 'versicolor'], loc='best')
plt.show()

The code of mglearn.discrete_scatter (X [:, 0], X [:, 1], Y) takes the X-axis as the first argument, the Y-axis as the second argument, and the correct label as the third argument, and scatter. Make a plot.

loc ='best' adjusts the legend so that it does not get in the way of the graph.

From the data above, you can clearly see that it can be separated by a straight line. It's rather too easy.

Let's create a model with the following code.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
svm = LinearSVC()
svm.fit(X_train, Y_train)

The model creation itself ends with this code. It's easy.

Let's illustrate what the model looks like with the code below.

plt.figure(figsize=(10, 6))
mglearn.plots.plot_2d_separator(svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(['setosa', 'versicolor'], loc='best')
plt.show()

You can see that the boundaries that separate the data are created.

The part mglearn.plots.plot_2d_separator (svm, X) is a little confusing, so I will explain it. Let's check the definition code.

plot_2d_separator(classifier, X, fill=False, ax=None, eps=None, alpha=1,cm=cm2, linewidth=None, threshold=None,linestyle="solid"):

It is a function that draws a boundary line when you pass the classification model as the first argument and the original data as the second argument.

This concludes the implementation of the svm model for linear separable problems.

Classification problem: Soft margin

This time we will deal with the issue of soft margins.

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import mglearn

iris = load_iris()

X = iris.data[50:, 2:]
Y = iris.target[50:] - 1

mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.legend(['versicolor', 'virginica'], loc='best')
plt.show()

Now we are plotting the data for petal length and petal width for versicolor and verginica.

It's an impossible problem to achieve a perfect linear separation.

Here is a review of the soft margin formula. For derivation, refer to the article here.

min_{W, \xi}\Bigl\{\frac{1}{2}||W||^2 + C\sum_{i=1}^{N} \xi_i\Bigr\} \quad \quad
t_i(W^TX_i + b)\geq 1 - \xi_i\\
\xi_i = max\Bigl\{0, M - \frac{t_i(W^TX_i + b)}{||W||}\Bigr\}\\
i = 1, 2, 3, ... N

Since the data gets inside the margin, we relaxed the limit by the term $ C \ sum_ {i = 1} ^ {N} \ xi_i $.

The value of this C is 1.0 by default in skleaarn. Let's change this number and see how the figure changes. The following code defines a function that plots the model boundaries given as arguments.

def make_separate(model):
    mglearn.plots.plot_2d_separator(svm, X)
    mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
    plt.xlabel('petal length')
    plt.ylabel('petal width')
    plt.legend(['setosa', 'versicolor'], loc='best')
    plt.show()

Let's draw the figure with the following code. Let C = 0.1.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
svm = LinearSVC(C=0.1)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))

0.96

Next is C = 1.0.

svm = LinearSVC(C=1.0)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))

1.0

Next is C = 100.

svm = LinearSVC(C=100)
svm.fit(X_train, Y_train)
make_separate(svm)
print(svm.score(X_test, Y_test))

1.0

It is important to set an appropriate C. It seems good to see the situation while changing various things.

This is the end of the soft margin implementation.

Now, let's summarize the implementation when the kernel method is used and the implementation when it is not used.

Implemented without kernel method

This time, we will classify problems that are not linearly separable without using the kernel method.

Here, the method that does not use the kernel function is defined as not using the kernel method.

Let's prepare the data with the following code and illustrate it.

import mglearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

moons = make_moons(n_samples=300, noise=0.2, random_state=0)

X = moons[0]
Y = moons[1]
plt.figure(figsize=(12, 8))
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.plot()
plt.show()

make_moons is a function that creates data shaped like a two-dimensional moon.

You can set the number of samples and noise.

As you can see from the figure, it is obviously not linearly separable.

To transform this non-linearly separable data into linearly separable data, let's map the data in this input space to the data in the higher dimensional feature space.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.fit_transform(X_test)

Now you can map the data in the input space to the higher dimensional feature space.

Let's check what kind of data was mapped.

print(poly.get_feature_names())
print(X_train_poly.shape)

['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2'] (225, 6)

In this way, the 2D input space is extended to the 6D feature space.

Standardize the data with the following code.

scaler = StandardScaler()
X_train_poly_scaled = scaler.fit_transform(X_train_poly)
X_test_poly_scaled = scaler.fit_transform(X_test_poly)

Data standardization is to set the mean of the data to 0 and the variance to 1 by subtracting the mean for all the data and then dividing by the standard deviation.

I wrote it in an easy-to-understand article here, so please refer to it.

Now let's implement and evaluate the model with the following code.

lin_svm = LinearSVC()
lin_svm.fit(X_train_poly_scaled, Y_train)
print(lin_svm.score(X_test_poly_scaled, Y_test))

0.84

It's a little low. Let's map it to a higher dimension.

However, the process of mapping to a higher dimension and standardizing is troublesome, so let's use something called Pipeline.

poly_scaler_svm = Pipeline([
    ('poly', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('svm', LinearSVC())
])
poly_scaler_svm.fit(X_train, Y_train)
print(poly_scaler_svm.score(X_test, Y_test))

0.9733333333333334

In this way, Pipeline can be used to simplify the task of mapping data to a higher dimension, standardizing it, and putting it in the svm model. By setting degree = 3, it is mapped to a higher dimensional feature space.

The accuracy is pretty good. It is quite effective when mapped to a higher dimension.

Next, let's draw this figure. Below is the code.

_x0 = np.linspace(-1.5, 2.7, 100)
_x1 = np.linspace(-1.5, 1.5, 100)
x0, x1 = np.meshgrid(_x0, _x1)
X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))
y_decision = model.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)
plt.figure(figsize=(12, 8))
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.show()

You can see that the lines are pretty clean. Let's explain the code.

_x0 = np.linspace(-1.5, 2.7, 100)
_x1 = np.linspace(-1.5, 1.5, 100)
x0, x1 = np.meshgrid(_x0, _x1)

The grid points are created with the code here. Please refer to the article here for easy understanding.

np.linspace creates a numpy array by specifying the start point in the first argument, the end point in the second argument, and the number of points in the third argument. By passing it to np.meshgrid, we are creating 100x100 grid points.

X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))

After converting a 100 × 100 array to a one-dimensional array by (x0.ravel (), convert it to a two-dimensional 10000 × 1 matrix by reshape (-1, 1), and np.hstack By , they are connected in the horizontal direction of axis = 1, that is, X is a 10000 × 2 matrix.

y_decision = model.decision_function(X).reshape(x0.shape)
plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)

The distance between 10000 grid points and the separated hyperplane is calculated by model.decision_function (X) and converted into 100 × 100 data.

plt.contourf is a function that illustrates contour lines, and you can specify in levels where to change the color.

This completes the implementation that does not use the kernel method.

Implementation using the kernel method

Now, let's implement it using the kernel method.

Let's prepare the data. So far the same.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

moons = make_moons(n_samples=300, noise=0.2, random_state=0)

X = moons[0]
Y = moons[1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify=Y, random_state=0)

Let's implement the model with the following code.

karnel_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='poly', degree=3, coef0=1))
])

karnel_svm.fitX_train, Y_train()

By specifying poly in the karnel argument of SVC, you can specify the polynomial kernel, and by specifying degree = 3, you can think of mapping up to three dimensions.

You have now created a model. Next, let's illustrate this model. I do the same thing again, but it's a hassle, so I'll make it a function.

def plot_decision_function(model):
    _x0 = np.linspace(-1.7, 2.7, 100)
    _x1 = np.linspace(-1.5, 1.7, 100)
    x0, x1 = np.meshgrid(_x0, _x1)
    X = np.hstack((x0.ravel().reshape(-1, 1), x1.ravel().reshape(-1, 1)))
    y_decision = model.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_decision, levels=[y_decision.min(), 0, y_decision.max()], alpha=0.3)

def plot_dataset(x, y):
    plt.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'bo', ms=15)
    plt.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'r^', ms=15)
    plt.xlabel('$x_1$', fontsize=20)
    plt.ylabel('$x_2$', fontsize=20, rotation=0)

plt.figure(figsize=(12, 8))
plot_decision_function(karnel_svm)
plot_dataset(X, Y)
plt.show()

I could have plotted it with mglearn, but this time I plotted it with plt.plot. The ones with Y = 0 are drawn with blue circles, and the ones with Y = 1 are drawn with red triangles.

As you can see, the same result is returned with or without the kernel method. However, the kernel method is much easier to calculate internally, so I think it is better to use the kernel method as much as possible.

Please refer to the article here for how to make it easier.

At the end

Thank you for staying with us so far.

It was a very long article. Thank you very much for reading this far.

Thank you for your hard work.

[Python] I thoroughly explained the theory and implementation of support vector machine (SVM)