Explaining the classic machine learning "Support Vector Machine (SVM)" so that even high school students can understand it

Introduction

** "Support Vector Machine (SVM)" **, which is one of the standard algorithms for machine learning, Because of its practical and relatively simple algorithm, it is often covered in introductory books.

However, I felt that there were many books and articles that were too difficult to understand, so I also used them as a memorandum.

・ Comprehensive ・ Simple explanation -There is an implementation example with actual data (using Python library scikit-learn)

I would like to aim for an article that even high school students can say ** "I understand!" **.

Note

Part 1

・ I'm sorry that I named it ** even high school students can understand **. Knowledge of partial differentiation comes out without studying in high school (learning in the first or second year of science university).

There is a very easy-to-understand YouTube video, so if you look at it, you can say ** "I understand!" **. Partial differential Lagrange's undetermined multiplier method Lagrange's undetermined multiplier method for inequalities (KKT condition)

Part 2

・ For formulas in the text ${W^T}X$ The symbol "T" appears, but this is "Transpose" that swaps the vertical and horizontal directions of the matrix. It may seem esoteric at first glance, but it is an idiomatic expression for expressing the inner product of vectors as the product of matrices. In the case of the above equation, there is no problem if it is read as ** inner product of vectors W and X **.

Part 3

-As for the quadratic programming problem that appears at the end of the algorithm, I do not understand the solution by relying on the solver, so please understand that I will omit the explanation.

Positioning of support vector machines in machine learning

Machine learning algorithms can be broadly divided as shown in the figure below. image.png

Support vector machines are ** algorithms ** primarily used for "classification".

** ・ Classification example ** Case 1: Estimate the presence or absence of illness from body temperature and the number of coughs Case 2: Estimate fruit types (apples, oranges, grapes) from length, weight, and color

In case 1, there are only two types of classification, "yes" and "no", so we call it ** binary classification **. Case 2 has three or more classifications, so it is called ** multi-value classification (multi-class classification) **.

Multi-value classification and support vector machine

This article explains the binary classification algorithm, but you can also support other value classification by arranging multiple binary classification algorithms. Click here for details

Regression and support vector machine

There is also a method called "support vector regression" that applies this method to regression, but since it is used less frequently than classification, it is omitted in this article.

Basic concept of support vector machine

For example, using the two explanatory variables shown below (eg body temperature, number of coughs), Suppose you want to classify Class 1 (eg with illness) and Class 2 (eg without illness). image.png

In machine learning classification problems, classification is achieved by drawing straight lines and curves (planes in 3D and above) that are boundaries between these classes. The straight line or curve that forms the boundary is called the ** decision boundary **.

Consider the case where the decision boundary is drawn with a straight line in the above example. Which of the decision boundary A and decision boundary B in the figure below seems to have higher classification performance?

image.png I think many people find decision boundary A to be better. And as a reason to think that decision boundary B is not good, "The distance to the point painted in red is too close, so it seems to be misjudged." However, I think that it will be a comfortable explanation.

This feeling is algorithmized as shown in the figure below. image.png

** "Determine the decision boundary so that the distance to the nearest point (support vector) is long" ** Classification method, It is called ** Support Vector Machine **.

For example, in two dimensions a straight line equation

ax + by + c = 0

It will be. At this time, the distance between the point (xi, yi) and the straight line is

\frac{|ax_i+by_i+c|}{\sqrt{a^2+b^2}}

Therefore, for all training data (i = 1,2, ‥ n)

min_{i=1,2‥n}\frac{|ax_i+by_i+c|}{\sqrt{a^2+b^2}}

Finding a combination of a and b that maximizes (c is standardized and erased) is the learning of a support vector machine in two dimensions.

The details of the algorithm will be explained in the next section, so let's implement it with actual data first! Try to distinguish between basketball (NBA) players and American football (NFL) players by height and weight

nba_nfl_1.csv


name,league,position,height,weight
Wilt Chambelain,NBA,C,215.9,113.4
Bill Russel,NBA,C,208.3,97.5
Kareem Abdul-Jabbar,NBA,C,218.4,102.1
Elvin Hayes,NBA,PF,205.7,106.6
Moses Malone,NBA,C,208.3,97.5
Tim Duncan,NBA,PF,210.8,113.4
Karl Malone,NBA,PF,205.7,117.5
Robert Parish,NBA,C,215.9,110.7
Kevin Garnett,NBA,PF,210.8,108.9
Nate Thurmond,NBA,C,210.8,102.1
Walt Bellamy,NBA,C,208.3,102.1
Wes Unseld,NBA,C,200.7,111.1
Hakeem Olajuwon,NBA,C,213.4,115.7
Dwight Howard,NBA,C,208.3,120.2
Shaquille O'Neal,NBA,C,215.9,147.4
John Stockton,NBA,PG,185.4,79.4
Jason Kidd,NBA,PG,193,95.3
Steve Nash,NBA,PG,190.5,80.7
Mark Jackson,NBA,PG,190.5,88.5
Magic Johnson,NBA,PG,205.7,99.8
Oscar Robertson,NBA,PG,195.6,93
Chris Paul,NBA,PG,185.4,79.4
LeBron James,NBA,SF,205.7,113.4
Isiah Thomas,NBA,PG,185.4,81.6
Gary Payton,NBA,PG,193,86.2
Andre Miller,NBA,PG,190.5,90.7
Rod Strickland,NBA,PG,190.5,83.9
Maurice Cheeks,NBA,PG,185.4,81.6
Russel Westbrook,NBA,PG,190.5,90.7
Rajon Rondo,NBA,PG,185.4,81.6
Ray Lewis,NFL,LB,185.4,108.9
London Fletcher,NFL,LB,177.8,109.8
Derrick Brooks,NFL,LB,182.9,106.6
Donnie Edwards,NFL,LB,188,100.7
Zack thomas,NFL,LB,180.3,103.4
Keith Brooking,NFL,LB,188,108.9
Karlos Dansby,NFL,LB,193,113.4
Junior Seau,NFL,LB,190.5,113.4
Brian Urlacher,NFL,LB,193,117
Ronde Barber,NFL,DB,177.8,83.5
Lawyer Milloy,NFL,DB,182.9,95.7
Takeo Spikes,NFL,LB,188,109.8
James Farrior,NFL,LB,188,110.2
Charles Woodson,NFL,DB,185.4,95.3
Antoine Bethea,NFL,DB,180.3,93.4
Derrick Johnson,NFL,LB,190.5,109.8
Lance Briggs,NFL,LB,185.4,110.7
Antoine Winfield,NFL,DB,175.3,81.6
Rodney Harrison,NFL,DB,185.4,99.8
Brian Dawkins,NFL,DB,182.9,95.3

If you plot the vertical axis as weight and the horizontal axis as height

ml_param_instruction.py(Part of)


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
df_athelete = pd.read_csv(f'./nba_nfl_1.csv')  #Data reading
sns.scatterplot(x='height', y='weight', data=df_athelete, hue='league')  #Plot a scatter plot of the data points of the explanatory and objective variables
plt.xlabel('height [cm]')  #X-axis label (height)
plt.ylabel('weight [kg]')  #Y-axis label (weight)

image.png It looks like. It looks like it will be divided into straight lines!

This data is categorized by a support vector machine using scikit-learn and Visualize the decision boundary with mlxtend (Reference).

ml_param_instruction.py(Part of)


from sklearn.svm import SVC
from mlxtend.plotting import plot_decision_regions
def label_str_to_int(y):  #Convert objective variable from str type to int type(plot_decision_For regions)
    label_names = list(dict.fromkeys(y[:, 0]))
    label_dict = dict(zip(label_names, range(len(label_names))))
    y_int=np.vectorize(lambda x: label_dict[x])(y)
    return y_int
def legend_int_to_str(ax, y):  #Change legend from int type to str type(plot_decision_For regions)
    hans, labs = ax.get_legend_handles_labels()
    ax.legend(handles=hans, labels=list(dict.fromkeys(y[:, 0])))

X = df_athelete[['height','weight']].values  #Explanatory variable(height, weight)
y = df_athelete[['league']].values  #Objective variable(Event)
y_int = label_str_to_int(y)  #Convert objective variable to int type
model = SVC(kernel='linear')  #Define a linear SVM
model.fit(X, y_int)  #Perform SVM learning
ax = plot_decision_regions(X, y_int[:, 0], clf=model) #Visualize decision boundaries
plt.xlabel('height [cm]')  #x-axis label
plt.ylabel('weight [kg]')  #y-axis label
legend_int_to_str(ax, y)  #Change legend from int type to str type

You can see that the decision boundary is drawn by a straight line and the margin is maximized.

Support vector machine performance improvement

[This article] As mentioned in (), machine learning generally supports the following functions in order to improve estimation performance.

** A. Multidimensional explanatory variables: ** Supports any n-dimensional explanatory variables ** B. Non-linear: ** Change the decision boundary to a flexible shape other than a straight line (hyperplane) ** C. Improved generalization performance: ** Prevent overfitting to training data and improve estimation ability for unknown data

Algorithms corresponding to A to C are also added to SVM, so they will be explained in detail below.

A. Multidimensional explanatory variables

In the following formula, uppercase letters represent vectors (Reference)

In the above two-dimensional example, the decision boundary is drawn by a straight line, but if it is generalized in three or more dimensions, the boundary is drawn by a hyperplane. The hyperplane equation is

{W^T}X_i+w_0=0

It is represented by. The distance between this hyperplane and the point ** Xi ** is

d = \frac{|w_1x_1 + w_2x_2... + w_nx_n + w_0|}{\sqrt{w_1^2+w_2^2...+w_n^2}} = \frac{|W^TX_i + w_0|}{||W||}

It will be. With the minimum value of this distance as the margin m,

y_i= \left\{
\begin{array}{ll}
1 & (X_When i is class 1) \\
-1 & (X_When i is class 2)
\end{array}
\right.

If you define yi, Margin m is expressed as

\frac{y_i(W^TX_i + w_0)}{||W||} \geq m  \quad (i = 1, 2, ...N)

If you divide both sides by m and deform

y_i\biggl(\biggl(\frac{W}{m||W||}\biggl)^TX_i + \frac{w_0}{m||W||}\biggl) - 1 \geq 0

here

New W= \frac{W}{m||W||}\\
New w_0 = \frac{w_0}{m||W||}

When standardized so that (As in the first equation, W can be of any size as long as the directions of the vectors that stretch the plane are the same, so it is easy to calculate.||W||=1/m(Standardized to be)

y_i({W^T}X_i+w_0)-1 \geq 0

It will be. Under this condition, the margin

m=\frac{1}{||W||}

Should be maximized, To make subsequent calculations easier

\frac{1}{2}{||W||}^2

Replace with the problem of minimizing.

Then, the conditional expression for maximizing the margin is

g_i(W,w_0)=y_i({W^T}X_i+w_0)-1 \geq 0

(* The point closest to the decision boundary when the equal sign holds = support vector) Under the constraints of

f(W)=\frac{1}{2}{||W||}^2

Is a problem to minimize. Since it is a minimization problem with inequality sign conditions, it is based on Lagrange's undetermined multiplier method.

L(W,w_0,\alpha)=f(W)-\sum_{i=1}^{n}\alpha_i g_i(W,w_0)\\
=\frac{1}{2}{||W||}^2-\sum_{i=1}^{n}\alpha_i(y_i({W^T}X_i+w_0)-1)

Find the extremum of.

Partial differential with W, w0 to find the extremum

\frac{\partial L}{\partial W}=0 \quad \Rightarrow \quad W=\sum_{i=1}^{n}\alpha_i y_i X_i\\
\frac{\partial L}{\partial w_0}=0 \quad \Rightarrow \quad \sum_{i=1}^{n}\alpha_i y_i=0

Therefore, by substituting these into the formula of L and adding the formula of the KKT condition (αi ≧ 0),

\sum_{i=1}^{n}\alpha_i y_i=0\\
\alpha_i \geq 0

Under the constraints of

L(\alpha)=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i \alpha_j y_i y_j X_i^T X_j - \sum_{i=1}^{n}\alpha_i

Is a problem to minimize. I will omit it because it will result in a quadratic programming problem related to α, but if you are interested, [this site is easy to understand] (https://satopirka.com/2018/12/theory-and-implementation-of-linear-support-vector-machine/)

By the above method, it is possible to find the hyperplane ** (coefficient W) that maximizes the margin even in the ** multidimensional explanatory variable.

B. Non-linear

In the previous example of identifying NBA and NFL players, replace the NFL player data with offensive (non-line) players who tend to be lighter.

name,event,position,height,weight
Wilt Chambelain,NBA,C,215.9,113.4
Bill Russel,NBA,C,208.3,97.5
Kareem Abdul-Jabbar,NBA,C,218.4,102.1
Elvin Hayes,NBA,PF,205.7,106.6
Moses Malone,NBA,C,208.3,97.5
Tim Duncan,NBA,PF,210.8,113.4
Karl Malone,NBA,PF,205.7,117.5
Robert Parish,NBA,C,215.9,110.7
Kevin Garnett,NBA,PF,210.8,108.9
Nate Thurmond,NBA,C,210.8,102.1
Walt Bellamy,NBA,C,208.3,102.1
Wes Unseld,NBA,C,200.7,111.1
Hakeem Olajuwon,NBA,C,213.4,115.7
Dwight Howard,NBA,C,208.3,120.2
Shaquille O'Neal,NBA,C,215.9,147.4
John Stockton,NBA,PG,185.4,79.4
Jason Kidd,NBA,PG,193,95.3
Steve Nash,NBA,PG,190.5,80.7
Mark Jackson,NBA,PG,190.5,88.5
Magic Johnson,NBA,PG,205.7,99.8
Oscar Robertson,NBA,PG,195.6,93
Chris Paul,NBA,PG,185.4,79.4
LeBron James,NBA,SF,205.7,113.4
Isiah Thomas,NBA,PG,185.4,81.6
Gary Payton,NBA,PG,193,86.2
Andre Miller,NBA,PG,190.5,90.7
Rod Strickland,NBA,PG,190.5,83.9
Maurice Cheeks,NBA,PG,185.4,81.6
Russel Westbrook,NBA,PG,190.5,90.7
Rajon Rondo,NBA,PG,185.4,81.6
Drew Brees,NFL,QB,182.9,94.8
Tom Brady,NFL,QB,193,102.1
Payton Manning,NFL,QB,195.6,104.3
Brett Favre,NFL,QB,188,100.7
Philip Rivers,NFL,QB,195.6,103.4
Dan Marino,NFL,QB,193,101.6
Ben Roethlisberger,NFL,QB,195.6,108.9
Eli Manning,NFL,QB,195.6,99.8
Matt Ryan,NFL,QB,193,98.4
John Elway,NFL,QB,190.5,97.5
Emmitt Smith,NFL,RB,175.3,100.2
Walter Payton,NFL,RB,177.8,90.7
Frank Gore,NFL,RB,175.3,96.2
Barry Sanders,NFL,RB,172.7,92.1
Adrian Peterson,NFL,RB,185.4,99.8
Curtis Martin,NFL,RB,180.3,95.3
LaDainian Tomlinson,NFL,RB,177.8,97.5
Jerome Bettis,NFL,RB,180.3,114.3
Eric Dickerson,NFL,RB,190.5,99.8
Tony Dorsett,NFL,RB,180.3,87.1
Jerry Rice,NFL,WR,188,90.7
Larry Fitzgerald,NFL,WR,190.5,98.9
Terrell Owens,NFL,WR,190.5,101.6
Randy Moss,NFL,WR,193,95.3
Isaac Bruce,NFL,WR,182.9,85.3
Tony Gonzalez,NFL,TE,195.6,112
Tim Brown,NFL,WR,182.9,88.5
Steve Smith,NFL,WR,175.3,88.5
Marvin Harrison,NFL,WR,182.9,83.9
Reggie Wayne,NFL,WR,182.9,92.1

ml_param_instruction.py(Part of)


df_athelete = pd.read_csv(f'./nba_nfl_2.csv')
sns.scatterplot(x='height', y='weight', data=df_athelete, hue='league')  #Plot a scatter plot of the data points of the explanatory and objective variables
plt.xlabel('height [cm]')  #x-axis label
plt.ylabel('weight [kg]')  #y-axis label

image.png

Unfortunately, it doesn't seem to be separated by a straight line (linear) decision boundary.

Since the above-mentioned SVM is an algorithm that assumes only linear separability, when linear separability is not possible, ** ① Non-linear determination boundary by kernel trick ** → Ingenuity corresponding to "B. Non-linear" ** ② Soft margin ** → Ingenuity corresponding to "C. Generalization performance" It is necessary to implement either (or both) of the two types of countermeasures. In this section, we will explain about ①.

What is a kernel trick?

For example, suppose you want to classify the blue and red dots in the figure below. image.png As you can see, it is difficult to separate linearly in the xy coordinate system. Then

z=x^2+y^2

So what happens if you plot in a coordinate system with the z-axis added? image.png

As shown in the above figure, linear separability is possible by adding the z-axis. Inversely transforming this separation plane into the original xy coordinate system is equivalent to drawing a circular (= non-linear) decision boundary.

In this way, by performing ** conversion to a higher-dimensional coordinate system Φ ** that combines the original coordinate axes (features), Even if linear separability is not possible, ** classification by non-linear decision boundaries can be achieved **.

At this time, in order to project ** x ** and convert it to ** Φ (x) **, the function that minimizes it by the Lagrange's undetermined multiplier method described above.

L(\alpha)=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i \alpha_j y_i y_j X_i^T X_j - \sum_{i=1}^{n}\alpha_i

Is

L(\alpha)=\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_i \alpha_j y_i y_j \phi(X_i)^T \phi(X_j) - \sum_{i=1}^{n}\alpha_i

Will be replaced with. However,

\phi(X_i)^T \phi(X_j) 

Because the calculation cost is high in the part of In general, the map Φ is not defined directly,

** Kernel function **

K(X_i,X_j)=\phi(X_i)^T \phi(X_j) 

Is often defined and calculated. The conversion method by this kernel function is called ** kernel trick **. The most commonly used kernel function

K(X_i,X_j)=exp\biggl(-\frac{||X_i-X_j||^2}{2\sigma^2}\biggl)\\
=exp \Bigl(-\gamma||X_i-X_j||^2 \Bigl)

It is a ** RBF kernel ** (Radial Basis Function kernel) defined in. ** γ included in this formula is one of the hyperparameters **

Looking at Explanation of scikit-learn formula, γ is described as a parameter representing ** "distance at which one learning data affects the discrimination surface" **, and the variance σ2 is calculated by a mathematical formula. I think that somehow you can get an image from the fact that it is the reciprocal of.

The larger γ, the smaller the range of influence of one point = the larger the curvature, the larger the discriminant surface. It is an image like that.

Let's learn the classification of NBA players and NFL players earlier with the RBF kernel by changing γ ("gamma" in scikit-learn).

ml_param_instruction.py(Part of)


X = df_athelete[['height','weight']].values  #Explanatory variable(height, weight)
y = df_athelete[['league']].values  #Objective variable(Event)
y_int = label_str_to_int(y)
for gamma in [0.1, 0.01, 0.001, 0.0001]:  #Loop by changing gamma
    model = SVC(kernel='rbf', gamma=gamma)  #Define SVM of RBF kernel by changing gamma
    model.fit(X, y_int)  #Perform SVM learning
    ax = plot_decision_regions(X, y_int[:, 0], clf=model)
    plt.xlabel('height [cm]')
    plt.ylabel('weight [kg]')
    legend_int_to_str(ax, y)
    plt.text(175, 140, f'gamma={model.gamma}, C={model.C}')  #Show gamma and C
    plt.show()

It can be seen that the smaller the gamma (γ), the smaller the curvature and the closer to linearity, and the larger the gamma, the larger the curvature. From here, you can see that ** gamma is a parameter ** that adjusts "how non-linear".

C. Generalization performance

As mentioned above, a technique called ** "soft margin" ** can be applied to SVMs to handle cases where linear separability is not possible (Details here).

Margin maximization constraint introduced in A

y_i({W^T}X_i+w_0)-1 \geq 0

However, this is an expression that means "no misclassification is allowed" (hard margin). If this does not allow linear separation, this constraint will not be met and learning will not be possible. Introducing the ** slack variable ξi ** expressed by the following equation,

\xi_i = max\bigl\{0, 1 - y_i(W^TX_i + w_0)\bigr\}

Rewrite the constraints as shown in the formula below to allow some misclassification (soft margin).

y_i({W^T}X_i+w_0)-1+\xi_i \geq 0

As a property of the slack variable ξi that can be read from the above equation ・ When ξi = 0, within the margin range in the original definition (linear separability is possible) ・ When 0 <ξi <1, the margin in the original definition is exceeded and the decision boundary is approached (not a misclassification). ・ When ξi> 1, misclassification occurs by jumping over the decision boundary. ・ The larger the ξi, the greater the degree of misclassification. From the viewpoint of preventing misclassification, we can see that ** slack variables are variables that we want to reduce **.

Therefore, the function that should be minimized in the margin maximization formula

f(W)=\frac{1}{2}{||W||}^2

Is added by multiplying the sum of the slack variables by the coefficient C.

f(W)=\frac{1}{2}{||W||}^2+C\sum_{i=1}^{n}\xi_i

By defining as a new function to be minimized, You can study by balancing "maximizing margin" and "allowing misclassification".

** The coefficient C that determines the balance of this misclassification tolerance is one of the hyperparameters **

Basically, the larger C, the greater the effect on the misclassification minimization function. Large C: Tendency not tolerate misclassification (closer to overfitting) C is small: Tendency to tolerate misclassification (unlearned = closer to generalization) Will be

Let's learn the classification of NBA players and NFL players earlier by changing C and using the RBF kernel. (Gamma is fixed at 0.01)

ml_param_instruction.py(Part of)


for C in [10, 1, 0.1]:  #Change C and loop
    model = SVC(kernel='rbf', gamma=0.01, C=C)  #Define SVM of RBF kernel by changing C
    model.fit(X, y_int)  #Perform SVM learning
    ax = plot_decision_regions(X, y_int[:, 0], clf=model) 
    plt.xlabel('height [cm]')
    plt.ylabel('weight [kg]')
    legend_int_to_str(ax, y)
    plt.text(175, 140, f'gamma={model.gamma}, C={model.C}')  #Show gamma and C
    plt.show()

image.png

The larger C, the less misclassification, but the more distorted decision boundaries (closer to overfitting). The smaller C, the more misclassification, but the smoother decision boundary (unlearned = closer to generalization) You can see that it is.

Summary

**-Support vector machine is an algorithm that maximizes the distance (margin) from the decision boundary to the latest point **

** ・ Kernel tricks support non-linear decision boundaries ** ** ・ RBF kernel is often used for kernel tricks ** **-The hyperparameter corresponding to the RBF kernel is "gamma", the smaller it is, the closer it is to linear **

** ・ Soft margin allows certain misclassification to improve generalization ** ** ・ The hyperparameter corresponding to the soft margin is "C", and the smaller it is, the closer it is to generalization (unlearned) **

in conclusion

The code is Uploaded here (Recommended operation with VS Code. Lines 101 and after correspond to the code in this article, but please execute in order from the first line)

** What values ​​should the parameters C and gamma be set to? I think there are many people who are interested in **, so A separate article is being created on parameter tuning. I hope you can look forward to it.

Recommended Posts

Explaining the classic machine learning "Support Vector Machine (SVM)" so that even high school students can understand it
SIR model that even junior high school students can understand
Machine learning that junior high school students fully understand with just this [updated daily]
Machine learning ① SVM (Support Vector Machine) Summary
[Django] A brief summary of the log output function so that even beginners can understand it.
The tangent equation problem (high school level) is troublesome so I can solve it
Machine learning support vector machine
Machine Learning: Supervised --Support Vector Machine
Machine learning algorithm (support vector machine)
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
Machine learning algorithm (support vector machine application)
<Course> Machine Learning Chapter 7: Support Vector Machine