1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level You can see that it is clearly weak in the explanation of "I don't know the background, but I got this result."

In this article, the two objectives are "Because the theory is good, try using scikit-learn first" in 2-3, and "Understand the background from mathematics" in 4 and later.

I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible.
Similar articles have been posted for Linear Simple Regression and Logistic Regression Ver, so please read them as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be)

2. What is SVM (Support Vector Machine)?

SVM is a model that can be used for classification and regression as supervised learning. And because there is a device to obtain high discrimination performance for unlearned data, it demonstrates excellent recognition performance. Source: [Wikipedia] (https://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83%99%E3%82%AF%E3%82%BF%E3%83%BC%E3%83%9E%E3%82%B7%E3%83%B3)

Roughly speaking, ** it tends to be a highly accurate model when new data is obtained **.

◆ Specific example

Suppose you are the president of an event planning company. Suppose you are planning a tour to see "rare cats" in response to the recent cat boom (a fictional setting).

"Rare cat" is determined here by "body size" and "beard length".

キャプチャ１.PNG キャプチャ2.PNG

Since there are too many candidates for the tour location, you have collected data on rare cats (= A) and so-called ordinary cats (= B). Based on that data, we will create a model that can determine whether it is a rare cat by inputting data on "body size" and "beard length" in the future, and focus on the place where it was determined that there is a rare cat. I will make a plan.

The distribution of the data is as follows.

Blue is a rare cat, and orange is a normal cat.
The X-axis is the body length and the Y-axis is the beard length.

◆ What is SVM?

Now, what kind of boundary is likely to be drawn between blue and orange in the distribution shown above? As shown below, there can be a red border and a green border in the data at hand. キャプチャ4.PNG

Now that I have one new data, I tried to plot it additionally. (Data in orange frame) キャプチャ5.PNG

In this case, the red border is correctly identified, but the green border is a rare cat (it is originally a normal cat), so it is a misidentification.

In order to prevent such misjudgment and find the correct classification standard, SVM uses the concept of ** "maximize margin" **. Margin is the distance between the upper border, such as red or green, and the actual data. The idea is that if this margin is large, ** "misjudgment due to slight changes in data" can be made as small as possible **.

The data near the boundary is, so to speak, data that makes it difficult to distinguish between "rare cats" and "ordinary cats." It would be a problem if there is a lot of subtle data, so the idea is that if you decide the boundary so that the distance between the boundary and the data is as far as possible, the risk of misjudgment can be minimized.

◆ About the penalty

However, there aren't many boundaries that can classify everything 100% perfectly. In the real world, data such as outliers sometimes come in, as shown below.

If you try to draw a boundary that accurately classifies this new orange point, you can imagine that it will probably be a boundary that does not match the actual situation. (So-called overfitting)

In order to make a judgment that suits the actual situation, SVM allows ** "some misjudgment" **.

It will appear in the next scikit-learn section, but how much misidentification is allowed? In fact, we have to decide ourselves to build the model, which we call a "penalty".

◆ To summarize ...

SVM can be said to be a model that realizes the following two ** "good feeling" **.

・ In order to prevent misjudgment as much as possible, try to draw a boundary that maximizes the distance between the boundary and the data, that is, the margin. ・ However, some misjudgment is allowed in order to draw a boundary that matches the actual situation.

3. SVM with scikit-learn

(1) Import of required libraries

Import the following required to perform SVM.

from sklearn.svm import SVC

#Below is a library for illustrations, pandas and numpy
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

(2) Data preparation

Set the length and whiskers data and the unusual, normal classification (True for rare cats, False for normal cats) as data as shown below.

For example, the first cat is 20 cm long and has a beard 10 cm long, making it a rare cat.

data = pd.DataFrame({
        "rare":[True,True,True,True,True,False,False,False,False,False,False,False,False],
        "scale":[20, 25, 30, 24, 28, 35, 40, 38, 55, 50, 60,32,25],
        "hige":[10, 20, 40, 18, 30, 10, 20, 30, 25, 28, 30,18,25],
    })

(3) Try to illustrate (important)

I will illustrate the body length / beard length and the rare / normal classification. In order to grasp the characteristics, do not use scikit-learn suddenly, but try to illustrate any data.

y = data["rare"].values
x1, x2 = data["scale"].values, data["hige"].values 

#Plot the data
plt.grid(which='major',color='black',linestyle=':')
plt.grid(which='minor',color='black',linestyle=':')
plt.plot(x1[y], x2[y], 'o', color='C0', label='rare')#Blue dot: y is True(=Rare)
plt.plot(x1[~y], x2[~y], '^', color='C1', label='normal')#Orange dot: y is False(=Ordinary)
plt.xlabel("scale")
plt.ylabel("hige")
plt.legend(loc='best')
plt.show()

Somehow, the boundary seems to be closed.

(4) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data["rare"].values#It's the same as the one shown above, so you can omit it.
X = data[["scale", "hige"]].values

Since this is not an article on python grammar, I will omit the details, but I will arrange x and y into a form for SVM with scikit-learn.

I think this is a code that cannot be written unless you understand it to some extent, so I would like to summarize it somewhere.

(Ii) Model construction

It's finally the model building code.

C = 10
clf = SVC(C=C,kernel="linear")
clf.fit(X, y)

That's it for a simple model. We will create an svm model in a variable called clf! The image is that the clf is fitted (= learned) with the prepared X and y in the next line.

◆ About arguments

The main arguments to consider when building an SVM model are $ C $ and kernel. ** <About $ C $> ** I will try it for the time being, so I will omit the details, but if you reduce the value of $ C $, it will be a model that allows misidentification.

If you do not specify anything in $ C $, that is, if you write "clf = SVC (kernel =" linear ")", $ C $ will be 1 by default.

** ** The types of kerenel are ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, and ‘precomputed’. [Official reference for details] (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Here, we will introduce ‘linear’ and ‘rbf’. Use linear to draw the boundary linearly (plane), and use rbf (nonlinear kernel function) to draw the boundary non-linearly. The result will change depending on which one you choose.

Here, we will introduce the differences in the illustration.

(5) Illustrate the constructed model

Now let's illustrate this boundary in the scatter plot above.

This code is a little difficult, so you can just copy and paste it without understanding it. In scikit-learn, it is okay if you calculate such a boundary line from learning and recognize that the lower right of this boundary is the humanities and the upper left is the science. [Reference site] (https://urusulambda.wordpress.com/2018/05/19/sklearn%E3%81%A72d%E3%83%87%E3%83%BC%E3%82%BF%E3%81%AE%E3%82%B7%E3%83%B3%E3%83%97%E3%83%AB%E3%81%AAsvm%E3%82%92%E5%8F%AF%E8%A6%96%E5%8C%96%E3%81%BE%E3%81%A7/)

fig,ax = plt.subplots(figsize=(6,4))
#Show data points
ax.scatter(X[:,0], X[:,1], c=y)
                                                                                                                                                                   
#Arrange 100 values in the x coordinate direction
x = np.linspace(np.min(X[:,0]), np.max(X[:,0]), 10)
#Arrange 100 values in the y coordinate direction
y = np.linspace(np.min(X[:,1]), np.max(X[:,1]), 10)
#x,With the x-coordinate of 10000 points combined with y,Array of y coordinates
x_g, y_g = np.meshgrid(x, y)
#np,c_Connect the two coordinates with,Pass to SVM
z_g = clf.predict(np.c_[x_g.ravel(), y_g.ravel()])
#z_g is an array column, but for display in the graph(100, 100)Return to the shape of
z_g = z_g.reshape(x_g.shape)

#Border coloring
ax.contourf(x_g,y_g,z_g,cmap=plt.cm.coolwarm, alpha=0.8);

#Display at the end
plt.show()

As a result of building the model, the boundaries were closed as shown above. When new data comes in after that, if it is plotted in the blue area, it will be classified as a normal cat, and if it is plotted in the red area, it will be classified as a rare cat.

By the way, if the kernel introduced in (4) ◆ Arguments is rbf, the boundary will be as follows. キャプチャ10.PNG

It's a completely different boundary! In this case, I feel that linear draws the boundaries of the data more appropriately, so let's use linear for the kernel.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is important to use this predictive model to distinguish between rare and normal when acquiring new cat data.

You got two other types of information and wrote down the data. Store it in a variable called z as shown below.

z = pd.DataFrame({
        "scale":[28, 45],
        "hige":[25, 20],
    })
z2 = z[["scale", "hige"]].values

Comparing this data with the illustration with the linear boundary, it seems that the first animal is probably classified as red (rare = True) and the second animal is classified as blue (normal = False). Now let's make a prediction.

y_est = clf.predict(z2)

By doing this, y_est will display the result as ([True, False]), so you can see that it is classified according to the border.

4. Understanding SVM from mathematics

By the way, up to 3, I tried to implement the flow of building an SVM model using scikit-learn → illustration → predicting the rare and normal of two other cats. Here, I would like to clarify how the SVM model of this flow is calculated mathematically.

If you do not need this knowledge at present, you can skip it.

(1) About maximizing the margin

I will delve into the margin maximization described in "2. What is SVM (Support Vector Machine)". I explained that the part where the distance between the point and the boundary of each data is the largest is the optimum boundary line, but what kind of state does that mean?

◆ Three-dimensional visualization

The scatter plot that has been shown so far can be rewritten in three dimensions as shown below.

Make the orange dot (normal) part stand out and think of it as an image seen from the side.

If you think of the green plane that passes through the red border above as the border, you can imagine that changing the ** "slope" ** of this plane will change the margin (= distance between the data and the border). Is it?

For example, if the slope of this plane is steep, the margin will be small as shown below.

On the contrary, if the slope of the plane is made gentle, the margin becomes large as shown below.

In other words, ** "the data can be classified neatly" and "the slope of the plane passing through the decision boundary is as gentle as possible" is the optimum boundary condition **.

◆ Margin formula

Then, what does ** "the slope of the plane passing through the decision boundary becomes as gentle as possible" **? I will continue to illustrate it.

I tried to show the view of the boundary surface from the side. This formula is expressed as $ w_1x_1 + w_2x_2 $.

As mentioned earlier, the maximum margin means that "the slope (= slope) of the plane passing through the decision boundary becomes as gentle as possible". The gentlest slope (= slope) means that even if you move $ x_1 $ or $ x_2 $ a little, the effect on $ w_1x_1 + w_2x_2 $ is small (= the slope is gentle, so set the value of $ x $ a little. Even if you move it, the value of the whole expression does not change much), that is, "** $ w_1, w_2 $ are small **".

If this is made into a formula, it will be as follows, but since understanding the norm is necessary and complicated to understand the meaning of this formula, at this point, "$ w_1 $ and $ w_2 $ of the boundary line formula are as small as possible. It is calculated so that it becomes. "

||w||_2^2← If this is minimized, the margin will be maximized

(2) Penalty

The basic idea ends with (1), but as mentioned in "◆ Penalty" in "2. What is SVM (Support Vector Machine)", some misunderstandings are made so that classification can be performed according to the actual situation. Allow another. How much misjudgment is allowed? The degree of is called a penalty. The penalty formula is expressed as follows, and $ ξ $ is called the hinge loss function. C(\sum_{i=1}^n ξi)

$ C $ has the same meaning as the argument described in (ii) Model construction, but the larger this $ C $ is, the more misjudgment is not allowed (= too large it makes overfitting easier). If you want to understand this formula in depth, you need to understand it in depth, so I will leave it to this point this time. (It may be built separately later, but I would like to summarize it here as well)

(3) To summarize ...

From (1) and (2), SVM is calculated to make the following objective functions as small as possible. Intuitively, ** I try to make the slope of the boundary surface as small as possible "to maximize the margin", but how much misjudgment is allowed in order to classify according to the actual situation? The penalty term of is added, and the formula of the boundary surface is set so that the overall balance feels good. ** **

||w||_2^2 +
C(\sum_{i=1}^n ξi)

5. Summary

What did you think. Since SVM requires a mathematical understanding of the background more than simple regression and logistic regression, I have not been able to describe it so deeply, but I hope that the understanding so far will help to deepen the understanding than before. ..

[Machine learning] Understanding SVM from both scikit-learn and mathematics