A linear SVM (Support Vector Machine) is a machine learning model that linearly separates and classifies feature spaces. If it cannot be separated linearly, it can be separated non-linearly by SVN using the kernel method.
Until now, I didn't really understand the kernel method, but the following article was very easy to understand.
About the kernel method in machine learning-Memomemo
After that, I am trying it in the environment of Jupyter Notebook prepared according to the following article. Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita
In this environment, you can access port 8888 with a browser and use Jupyter Notebook. You can open a new note by following New> Python 3 on the upper right button.
Also, a CSV file created at random https://github.com/suzuki-navi/sample-data/blob/master/sample-data-1.csv I am using.
Read the data from the CSV file and make it a DataFrame object.
import pandas as pd
from sklearn import model_selection
df = pd.read_csv("sample-data-1.csv", names=["id", "target", "data1", "data2", "data3"])
df
is a Pandas DataFrame object.
reference Try basic operations for Pandas DataFrame on Jupyter Notebook --Qiita
There are feature variables data1
, data2
, and data3
in this CSV data, but let's check the state of the data with a scatter plot.
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(df["data1"], df["data2"], c = df["target"])
plt.scatter(df["data1"], df["data3"], c = df["target"])
plt.scatter(df["data2"], df["data3"], c = df["target"])
reference Display histogram / scatter plot on Jupyter Notebook --Qiita
Looking at the scatter plot, it seems that it can be classified into two, data2
and data3
, so I will try it.
feature = df[["data2", "data3"]]
target = df["target"]
feature
is a Pandas DataFrame object and target
is a Pandas Series object.
There are 300 records, which are divided into training data and validation data, respectively, for the feature variable and the objective variable. It just splits the record in two, but you can easily split it with model_selection.train_test_split
. This will split it randomly.
feature_train, feature_test, target_train, target_test = model_selection.train_test_split(feature, target, test_size=0.2)
test_size = 0.2
is a specification that 20% of all data is used as verification data.
Feature variables (df [[" data2 "," data3 "]]
, feature_train
, feature_test
) are Pandas DataFrame objects, objective variables (df ["target "]
, target_train
,target_test
) Is a Series object.
Learn based on the created training data (feature_train
, target_train
).
from sklearn import svm
model = svm.SVC(kernel="linear")
model.fit(feature_train, target_train)
SVC (kernel =" linear ")
is a model of a linearly separable SVM classifier. Let's learn with fit
.
reference sklearn.svm.SVC — scikit-learn 0.21.3 documentation
Create an inference result (pred_train
) from the feature variable (feature_train
) of the training data with the trained model, compare it with the objective variable (target_train
), and evaluate the accuracy rate. You can easily evaluate it with a function called metrics.accuracy_score
.
from sklearn import metrics
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)
Due to the randomness of the logic, the result may be different each time, but it says 0.95
.
Evaluate with training data to see if it is overfitted or generalized.
pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)
It was displayed as 0.9333333333333333
. I'm not sure if it's okay.
Apart from scikit-learn, you can use plotting.plot_decision_regions
included in the package mlxtend
to visualize how it is classified in a scatter plot. You need to pass an array of NumPy to plot_decision_regions
instead of a Pandas object, so convert it with the methodto_numpy ()
.
from mlxtend import plotting
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)
Good vibes.
reference plot_decision_regions - Mlxtend.plotting - mlxtend pandas.DataFrame.to_numpy — pandas 0.25.3 documentation
I would like to try nonlinear separation. Let's use the RBF kernel.
All you have to do is change svm.SVC (kernel =" linear ")
to svm.SVC (kernel =" rbf ", gamma =" scale ")
. gamma =" scale "
is a hyperparameter for RBF kernel, and if you specify " scale "
, it will be calculated automatically from the number of training data and the variance of feature variables.
The code below will create, train, infer, and even evaluate the model.
model = svm.SVC(kernel="rbf", gamma="scale")
model.fit(feature_train, target_train)
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)
It was displayed as 0.95
.
Evaluate with training data to see generalization performance.
pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)
It was displayed as 0.95
. It's a little better than the linear separation I mentioned earlier.
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)
As it is non-linear, it is certainly separated by a curve.
This sample was easy to linearly separate, so it may not have been enough to make it non-linear.
Since data2
and data3
can be linearly separated, try the RBF kernel with other data combinations.
First of all, data1
and data2
. Make only the figure that shows the separation with the following code.
feature = df[["data1", "data2"]]
target = df["target"]
feature_train, feature_test, target_train, target_test = model_selection.train_test_split(feature, target, test_size=0.2)
model = svm.SVC(kernel="rbf", gamma="scale")
model.fit(feature_train, target_train)
plotting.plot_decision_regions(feature.to_numpy(), target.to_numpy(), clf=model)
Let's see the correct answer rate.
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)
It was 0.7583333333333333
.
pred_test = model.predict(feature_test)
metrics.accuracy_score(target_test, pred_test)
It was 0.7833333333333333
.
By the way, even if I tried linearly (kernel =" linear "
) with the same data, it was 0.71 to 0.74. Looking at the figure, it seems that the kernel method is working hard, but isn't there a big difference in numerical values? Shouldn't we expect too much just because we can make non-linearity?
I tried it with data1
and data3
, but it was similar, so I omitted it ...
that's all.
Sequel Try clustering with mixed Gauss model on Jupyter Notebook --Qiita
Recommended Posts