About the processing speed of SVM (SVC) of scikit-learn

2016.09.14 Added about the variation of processing time </ Font>

I compared the processing speeds of scikit-learn's SVC (rbf kernel and linear kernel) and LinearSVC.

The data used is spam data included in the R kernlab package. Explanatory variables are 4601 samples, 57 dimensions, The labels are spam: 1813 samples and nonspam: 2788 samples.

The results when the number of samples and the number of dimensions are changed are as follows.

result.png

The SVC linear kernel is too slow. I just want to do a grid search including the kernel type, but It seems better to use LinearSVC properly.

The verification code is below. Parameter C is assigned for the convenience of processing time measurement. For feature selection (dimension reduction), we used the feature importance of Random Forest. This is because the processing time became longer when the selection was made appropriately.

test_svm.py


# -*- coding: utf-8 -*-

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mquantiles


def grid_search(X, y, estimator, params, cv, n_jobs=3):
    mdl = GridSearchCV(estimator, params, cv=cv, n_jobs=n_jobs)
    t1 = time.clock()
    mdl.fit(X, y)
    t2 = time.clock()
    return t2 - t1


if __name__=="__main__":
    data = pd.read_csv('spam.txt', header=0)
    y = data['type']
    del data['type']
    
    data, y = shuffle(data, y, random_state=0)
    data = StandardScaler().fit_transform(data)
    
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(data, y)

    ndim, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
    for thr in mquantiles(clf.feature_importances_, prob=np.linspace(1., 0., 5)):
        print thr,
        X = data[:,clf.feature_importances_ >= thr]
        ndim.append(X.shape[1])
        
        cv = cross_validation.StratifiedShuffleSplit(y, test_size=0.2, random_state=0)

        print 'rbf',
        elp_rbf.append(grid_search(X, y, SVC(random_state=0),
            [{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

        print 'linear',
        elp_lnr.append(grid_search(X, y, SVC(random_state=0),
            [{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

        print 'lsvm'
        elp_lsvm.append(grid_search(X, y, LinearSVC(random_state=0),
            [{'C': [1, 10, 100]}], cv))

    plt.figure()
    plt.title('Elapsed time - # of dimensions')
    plt.ylabel('Elapsed time [sec]')
    plt.xlabel('# of dimensions')
    plt.grid()
    plt.plot(ndim, elp_rbf, 'o-', color='r',
             label='SVM(rbf)')
    plt.plot(ndim, elp_lnr, 'o-', color='g',
             label='SVM(linear)')
    plt.plot(ndim, elp_lsvm, 'o-', color='b',
             label='LinearSVM')
    plt.legend(loc='best')
    plt.savefig('dimensions.png', bbox_inches='tight')
    plt.close()


    nrow, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
    for r in np.linspace(0.1, 1., 5):
        print r,
        X = data[:(r*data.shape[0]),:]
        yy = y[:(r*data.shape[0])]
        nrow.append(X.shape[0])
        
        cv = cross_validation.StratifiedShuffleSplit(yy, test_size=0.2, random_state=0)

        print 'rbf',
        elp_rbf.append(grid_search(X, yy, SVC(random_state=0),
            [{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

        print 'linear',
        elp_lnr.append(grid_search(X, yy, SVC(random_state=0),
            [{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

        print 'lsvm'
        elp_lsvm.append(grid_search(X, yy, LinearSVC(random_state=0),
            [{'C': [1, 10, 100]}], cv))

    plt.figure()
    plt.title('Elapsed time - # of samples')
    plt.ylabel('Elapsed time [sec]')
    plt.xlabel('# of samples')
    plt.grid()
    plt.plot(nrow, elp_rbf, 'o-', color='r',
             label='SVM(rbf)')
    plt.plot(nrow, elp_lnr, 'o-', color='g',
             label='SVM(linear)')
    plt.plot(nrow, elp_lsvm, 'o-', color='b',
             label='LinearSVM')
    plt.legend(loc='best')
    plt.savefig('samples.png', bbox_inches='tight')
    plt.close()

Postscript

I received a comment about the processing time of SVM (linear), so I checked it. With Python2.7.12, scikit-learn0.17.1, The figure below shows the variation in processing time when the number of data is 1000, the number of features is 29, and 200 trials are performed.

SVM (linear), it's suspicious ...

freq.png

Recommended Posts

About the processing speed of SVM (SVC) of scikit-learn
About SVC of svm module (Math is omitted.)
I checked the processing speed of numpy one-dimensionalization
About the behavior of Queue during parallel processing
About max_iter of LogisticRegression () of scikit-learn
python3 Measure the processing speed.
About the ease of Python
Parallel processing with Parallel of scikit-learn
About the components of Luigi
About the features of Python
How to increase the processing speed of vertex position acquisition
About the return value of pthread_mutex_init ()
About the return value of the histogram.
About the basic type of Go
About the upper limit of threads-max
About the behavior of yield_per of SqlAlchemy
About the size of matplotlib points
About color halftone processing of images
About the basics list of Python basics
Consider the speed of processing to shift the image buffer with numpy.ndarray
Let's talk about the tone curve of image processing ~ LUT is amazing ~
About the behavior of enable_backprop of Chainer v2
About the virtual environment of python version 3.7
About the arguments of the setup function of PyCaret
About the Normal Equation of Linear Regression
The story of blackjack A processing (python)
About the main tasks of image processing (computer vision) and the architecture used
About the accuracy of Archimedean circle calculation method
About the behavior of copy, deepcopy and numpy.copy
About the X-axis notation of Matplotlib bar graphs
View the result of geometry processing in Python
A note about the python version of python virtualenv
Compare the speed of Python append and map
Image processing? The story of starting Python for
About the development contents of machine learning (Example)
Predict the second round of summer 2016 with scikit-learn
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
100 language processing knock-75 (using scikit-learn): weight of features
About the * (asterisk) argument of python (and itertools.starmap)
[Python] Determine the type of iris with SVM
An easy way to measure the processing speed of a disk recognized by Linux
About the test
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data
I took a look at the contents of sklearn (scikit-learn) (1) ~ What about the implementation of CountVectorizer? ~
I tried to compare the processing speed with dplyr of R and pandas of Python
About the queue
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
A memorandum about the warning of the pylint output result
Take a peek at the processing of LightGBM Tuner
How to visualize the decision tree model of scikit-learn
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Think about the next generation of Rack and WSGI
About testing in the implementation of machine learning models
About the inefficiency of data transfer in luigi on-memory
Examine the close processing of Python dataset (SQLAlchemy wrapper)
About the uncluttered arrangement in the import order of flake8
A story about changing the master name of BlueZ
Personal notes about the integration of vscode and anaconda
A reminder about the implementation of recommendations in Python