Cross Validation with scikit-learn

I tried scikit-learnCross Validation and Grid Search.

Cross Validation

More details can be found in Wikipedia. Cross Validation is one of the methods to verify the validity of a model. Generally, development data is divided into training data and verification data. However, if this is done as it is, the training data will be reduced, and the generalization performance may be lowered depending on how the training data is selected. This is the hold-out test of what is written on Wikipedia. Generally this is not Cross Validation.

This is the K-validation cross-validation written here. In the K-partition cross-validation test, the data for development is divided into K pieces, K-1 pieces are used for training, and the remaining one is used for verification to calculate the validity of the model. As a result, the training data that can be used increases, and at the same time, the generalization performance can be improved by changing the training data.

I wrote how to do it concretely with scikit-learn. The data used for the training was from Kaggle's Data Science London.

SVM

First of all, the code when classifying with a support vector machine

# -*- coding: utf-8 -*-

import os
import sys
from sklearn import svm
import numpy as np
import csv

if __name__ == "__main__":
    train_feature_file = np.genfromtxt(open("../data/train.csv", "rb"), delimiter=",", dtype=float)
    train_label_file = np.genfromtxt(open("../data/trainLabels.csv", "rb"), delimiter=",", dtype=float)

    train_features = []
    train_labels = []
    for train_feature, train_label in zip(train_feature_file, train_label_file):
        train_features.append(train_feature)
        train_labels.append(train_label)

    train_features = np.array(train_features)
    train_labels = np.array(train_labels)

    clf = svm.SVC(C=100, cache_size=200, class_weight=None, coef0=0.0, degree=3,gamma=0.001, kernel="rbf", max_iter=-1, probability=False,random_state=None, shrinking=True, tol=0.001, verbose=False)

    clf.fit(train_features, train_labels)

    test_feature_file = np.genfromtxt(open("../data/test.csv", "rb"), delimiter=",", dtype=float)

    test_features = []
    print "Id,Solution"
    i = 1
    for test_feature in test_feature_file:
        print str(i) + "," + str(int(clf.predict(test_feature)[0]))
        i += 1

Let's validate this model with Cross Validation.

def get_score(clf, train_features, train_labels):
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(train_features, train_labels, test_size=0.4, random_state=0)

    clf.fit(X_train, y_train)
    print clf.score(X_test, y_test)

cross_validation.train_test_split is a function that divides development data so that a certain percentage becomes validation data. In this case, since test_size = 0.4 is specified, 40% of the data will be used for verification. fit is done with 60% training data, and the score is verified with the remaining 40% data and the correct answer rate is given. This is the validity of this model in this test data. Of course, the higher this is, the better Whether or not the generalization performance is high cannot be read from here. Therefore, it is possible to perform K verifications by performing K division. By averaging these scores, the validity of the model including generalization performance can be expressed.

def get_accuracy(clf, train_features, train_labels):
    scores = cross_validation.cross_val_score(clf, train_features, train_labels, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

You can get all the scores for these validations with cross_validation_cross_val_score. cv can specify the number of divisions of K division. This time, the data for development will be divided into 10 pieces and verified 10 times. scores will return a list of 10 scores. The average of this is given as Accuracy. With this, the validity of the model including generalization performance can be obtained, but it is necessary to tune the model parameters manually. It is very troublesome to adjust by hand and calculate Accuracy, so an algorithm called Grid Search can automate this tuning to some extent.

Grid Search

Grid Search is a method to search for the optimal set of parameters empirically by specifying the range of parameters. To do it in Python, write as follows.

def grid_search(train_features, train_labels):
    param_grid = [
        {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
        {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
    ]
    
    clf = GridSearchCV(svm.SVC(C=1), param_grid, n_jobs=-1)
    clf.fit(train_features, train_labels)
    print clf.best_estimator_

This range can be specified by specifying it in param_grid. You can specify the number of processes that perform calculations in parallel in n_jobs. If -1 is specified, the number of cores is selected by default. Perform Grid Search on the given training data. It will take some time, but you can choose the model parameters that give the highest score for this training data. This training data can be used for actual test data.