--Classify handwritten digit image data by SVM --Evaluate model score with Cross Validation --Change hyperparameter C and see how the score changes --Change the hyperparameter gamma and see how the score changes
The source is here.
Import the cross-validation library "cross_validation". Data uses handwritten digit digits.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import svm, datasets, cross_validation
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
The following cross-validation methods are available. --kFold (n, k): Divide n sample data into k batches. Use one batch for testing and the remaining (k-1) batch for training. Change the data set used for the test and repeat the test k times. --StratifiedKFold (y, k): Split data into k pieces while preserving the ratio of labels in the split dataset. --LeaveOneOut (n): Equivalent to the case of kFold with k = n. When the number of data samples is small. --LeaveOneLabelOut (labels): Split data according to a given label. For example, when dealing with data related to the year, when performing the test separately for each year.
This time, I will use the simplest KFold. The number of divisions was 4. As I noticed later, in KFold, if you define the variable'shuffle = true', it seems that the data order will be sorted randomly.
np.random.seed(0) #Random number seed setting, it doesn't have to be 0
indices = np.random.permutation(len(X_digits))
X_digits = X_digits[indices] #Randomly sort the order of the data
y_digits = y_digits[indices]
n_fold = 4 #Number of cross-validations
k_fold = cross_validation.KFold(n=len(X_digits),n_folds = n_fold)
# k_fold = cross_validation.KFold(n=len(X_digits),n_folds = n_fold, shuffle=true)
#If so, the first four lines are unnecessary.
Change the hyperparameter C to see how the evaluation value of the model changes. C is a parameter that determines how much false positives are allowed. The SVM kernel was a Gaussian kernel. Reference: Past article "Recognizing handwritten numbers with SVM"
C_list = np.logspace(-8, 2, 11) # C
score = np.zeros((len(C_list),3))
tmp_train, tmp_test = list(), list()
# score_train, score_test = list(), list()
i = 0
for C in C_list:
svc = svm.SVC(C=C, kernel='rbf', gamma=0.001)
for train, test in k_fold:
svc.fit(X_digits[train], y_digits[train])
tmp_train.append(svc.score(X_digits[train],y_digits[train]))
tmp_test.append(svc.score(X_digits[test],y_digits[test]))
score[i,0] = C
score[i,1] = sum(tmp_train) / len(tmp_train)
score[i,2] = sum(tmp_test) / len(tmp_test)
del tmp_train[:]
del tmp_test[:]
i = i + 1
It is easier to write if you just look at the evaluation value of the test. You can also specify the number of CPUs used in the variable n_jobs. -1 uses all CPUs.
cross_validation.cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
The evaluation value is output by array.
array([ 0.98888889, 0.99109131, 0.99331849, 0.9844098 ])
With C as the horizontal axis, the evaluation value during training and the evaluation value during testing are plotted. If C is small, the accuracy will not increase, probably because erroneous judgment is allowed too much.
xmin, xmax = score[:,0].min(), score[:,0].max()
ymin, ymax = score[:,1:2].min()-0.1, score[:,1:2].max()+0.1
plt.semilogx(score[:,0], score[:,1], c = "r", label = "train")
plt.semilogx(score[:,0], score[:,2], c = "b", label = "test")
plt.axis([xmin,xmax,ymin,ymax])
plt.legend(loc='upper left')
plt.xlabel('C')
plt.ylabel('score')
plt.show
Then, fix C to 100 and change gamma for the same experiment. The larger the gamma, the more complex the classification boundaries.
g_list = np.logspace(-8, 2, 11) # C
score = np.zeros((len(g_list),3))
tmp_train, tmp_test = list(), list()
i = 0
for gamma in g_list:
svc = svm.SVC(C=100, gamma=gamma, kernel='rbf')
for train, test in k_fold:
svc.fit(X_digits[train], y_digits[train])
tmp_train.append(svc.score(X_digits[train],y_digits[train]))
tmp_test.append(svc.score(X_digits[test],y_digits[test]))
score[i,0] = gamma
score[i,1] = sum(tmp_train) / len(tmp_train)
score[i,2] = sum(tmp_test) / len(tmp_test)
del tmp_train[:]
del tmp_test[:]
i = i + 1
Here are the results. As the gamma is increased, both the accuracy during training and the accuracy during testing increase, but after 0.001, the accuracy during training does not change, but the accuracy during testing decreases. It seems that overfitting is occurring due to too much complexity. It turns out that the setting of variables is important.
Recommended Posts