2016.09.14 Added about the variation of processing time </ Font>
I compared the processing speeds of scikit-learn's SVC (rbf kernel and linear kernel) and LinearSVC.
The data used is spam data included in the R kernlab package. Explanatory variables are 4601 samples, 57 dimensions, The labels are spam: 1813 samples and nonspam: 2788 samples.
The results when the number of samples and the number of dimensions are changed are as follows.
The SVC linear kernel is too slow. I just want to do a grid search including the kernel type, but It seems better to use LinearSVC properly.
The verification code is below. Parameter C is assigned for the convenience of processing time measurement. For feature selection (dimension reduction), we used the feature importance of Random Forest. This is because the processing time became longer when the selection was made appropriately.
test_svm.py
# -*- coding: utf-8 -*-
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mquantiles
def grid_search(X, y, estimator, params, cv, n_jobs=3):
mdl = GridSearchCV(estimator, params, cv=cv, n_jobs=n_jobs)
t1 = time.clock()
mdl.fit(X, y)
t2 = time.clock()
return t2 - t1
if __name__=="__main__":
data = pd.read_csv('spam.txt', header=0)
y = data['type']
del data['type']
data, y = shuffle(data, y, random_state=0)
data = StandardScaler().fit_transform(data)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(data, y)
ndim, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
for thr in mquantiles(clf.feature_importances_, prob=np.linspace(1., 0., 5)):
print thr,
X = data[:,clf.feature_importances_ >= thr]
ndim.append(X.shape[1])
cv = cross_validation.StratifiedShuffleSplit(y, test_size=0.2, random_state=0)
print 'rbf',
elp_rbf.append(grid_search(X, y, SVC(random_state=0),
[{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))
print 'linear',
elp_lnr.append(grid_search(X, y, SVC(random_state=0),
[{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))
print 'lsvm'
elp_lsvm.append(grid_search(X, y, LinearSVC(random_state=0),
[{'C': [1, 10, 100]}], cv))
plt.figure()
plt.title('Elapsed time - # of dimensions')
plt.ylabel('Elapsed time [sec]')
plt.xlabel('# of dimensions')
plt.grid()
plt.plot(ndim, elp_rbf, 'o-', color='r',
label='SVM(rbf)')
plt.plot(ndim, elp_lnr, 'o-', color='g',
label='SVM(linear)')
plt.plot(ndim, elp_lsvm, 'o-', color='b',
label='LinearSVM')
plt.legend(loc='best')
plt.savefig('dimensions.png', bbox_inches='tight')
plt.close()
nrow, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
for r in np.linspace(0.1, 1., 5):
print r,
X = data[:(r*data.shape[0]),:]
yy = y[:(r*data.shape[0])]
nrow.append(X.shape[0])
cv = cross_validation.StratifiedShuffleSplit(yy, test_size=0.2, random_state=0)
print 'rbf',
elp_rbf.append(grid_search(X, yy, SVC(random_state=0),
[{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))
print 'linear',
elp_lnr.append(grid_search(X, yy, SVC(random_state=0),
[{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))
print 'lsvm'
elp_lsvm.append(grid_search(X, yy, LinearSVC(random_state=0),
[{'C': [1, 10, 100]}], cv))
plt.figure()
plt.title('Elapsed time - # of samples')
plt.ylabel('Elapsed time [sec]')
plt.xlabel('# of samples')
plt.grid()
plt.plot(nrow, elp_rbf, 'o-', color='r',
label='SVM(rbf)')
plt.plot(nrow, elp_lnr, 'o-', color='g',
label='SVM(linear)')
plt.plot(nrow, elp_lsvm, 'o-', color='b',
label='LinearSVM')
plt.legend(loc='best')
plt.savefig('samples.png', bbox_inches='tight')
plt.close()
I received a comment about the processing time of SVM (linear), so I checked it. With Python2.7.12, scikit-learn0.17.1, The figure below shows the variation in processing time when the number of data is 1000, the number of features is 29, and 200 trials are performed.
SVM (linear), it's suspicious ...
Recommended Posts