Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization

Overview

Let's consider the example ** "Let's automatically identify the handwritten zip code written on the postcard" **.

This is an article for beginners. Basically, it is a collection of scikit-learn tutorials and Documents, but it also includes other contents. We will use digits for the dataset and SVM (SVC to be exact) for the machine learning method.

Dataset: digits

digits is a dataset that is a set of numeric labels and numeric image data. You will learn this label and image pair later. Since the data is prepared in advance by scikit-learn, anyone can easily try it.

Read data

You can read the dataset digits with datasets.load_digits ().

from sklearn import datasets
from matplotlib import pyplot as plt
# from sklearn import datasets

digits = datasets.load_digits()

View the contents of the data

Each image is a handwritten character image from 0 to 9. These images are programmatically represented as a two-dimensional array with values between 0 and 255.

#Image array data
print(digits.data)
[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

It is difficult to understand even if you look at the image data as an array, so I would like to display it as an image.

Before displaying the image, first check the label data. As shown below, the labels 0 to 9 are correctly assigned in advance.

#label
print(digits.target)
[0 1 2 ..., 8 9 8]

Looking at the above result, for example, the 0, 1, 2nd image from the beginning is labeled 0, 1, 2 and the second image from the back is labeled 9. .. You can use matplotlib to display these images.

#Image display
# number 0
plt.subplot(141), plt.imshow(digits.images[0], cmap = 'gray')
plt.title('number 0'), plt.xticks([]), plt.yticks([])

# number 1
plt.subplot(142), plt.imshow(digits.images[1], cmap = 'gray')
plt.title('numbert 1'), plt.xticks([]), plt.yticks([])

# number 2
plt.subplot(143), plt.imshow(digits.images[2], cmap = 'gray')
plt.title('numbert 2'), plt.xticks([]), plt.yticks([])

# number 9
plt.subplot(144), plt.imshow(digits.images[-2], cmap = 'gray')
plt.title('numbert 9'), plt.xticks([]), plt.yticks([])

plt.show()

output_7_0.png

In this way, you can see that each image seems to have the correct label.

Image classification by SVM

What is SVM

** SVM (Support Vector Machine) ** is one of the supervised learning methods with excellent recognition performance. Basically, the two-class classification is based on maximizing the margin. Of course, it can also be applied to multi-class classification (by performing two-class classification multiple times).

With a strict SVM, if the data to be classified overlaps (that is, if not all the data can be completely separated), it is not possible to obtain a proper classification boundary. On the other hand, an error-tolerant SVM is called a ** soft margin SVM **. By giving a penalty C to misclassification, it is possible to draw a classification boundary that minimizes misclassification even for data that cannot be completely separated.

It is important to note that the larger the penalty C, the more severe the error, and at the same time, the more likely it is to cause ** overfitting **.

(Note) Overfitting means that the training model fits into specific random features (unrelated to the features that you originally want to train) in the training data. When overfitting occurs, the performance of the training data improves, but the results of other data are worse. (Reference: Overfitting-Wikipedia)

SVM with scikit-learn

In fact, scikit-learn has slightly different types of SVMs such as SVC, NuSVC, and LinearSVC. NuSVC and SVC are very similar techniques, but they have slightly different parameter sets and are mathematically represented by different formulations. LinearSVC is an SVM that uses a linear kernel, and no other kernel can be specified.

This time, we will use SVC and apply a soft margin. All you have to do is ** (1) create a classifier and (2) apply it to your data **.

(1) Creation of classifier

from sklearn import svm

# SVM
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-10], digits.target[:-10])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Although rarely specified here, you can see that the SVC has quite a few parameters (C, cache_size, class_weight, coef0, ...) as shown above. Don't worry too much at first, the default settings are fine.

(2) Image classification by classifier

The label is actually estimated from the image using the created classifier. Let's try with the last 10 data that we haven't used to create the training model.

clf.predict(digits.data[-10:])
array([5, 4, 8, 8, 4, 9, 0, 8, 9, 8])

Looking at the actual data,

print(digits.target[-10:])
[5 4 8 8 4 9 0 8 9 8]

Is in agreement with the estimation result.

This confirms that a roughly correct estimate is possible from the trained model. Let's try different parameters.

Accuracy evaluation of classifier

Evaluation index of the classifier

There are several evaluation indexes for the classification accuracy of the classifier, but basically it can be measured by the following indexes.

Normally, the accuracy of the classifier is often evaluated by the F value. However, in practice, it is often the case that the emphasis on precision or recall is different.

Conformance and recall

For example, consider a factory parts inspection. It is not a big problem if you mistakenly classify a part that is not broken anywhere as "broken (error)". However, if a broken part is mistakenly classified as "unbroken (correct)", it may cause complaints and recalls, and even life-threatening depending on the product. In such cases, the precision rate is more important than the recall rate. For example, in "compliance rate 99% + recall rate 70%" and "compliance rate 80% + recall rate 99%", the latter has a higher F value, but the former is overwhelmingly more practical. It is possible. On the other hand, when searching a database, recall rate is often more important than precision rate. Even if you get a lot of wrong search results, it's much better than a lot of data that you can't find by searching.

Parameter optimization

Until now, the parameters were somehow set to appropriate values. However, this often does not provide the required classification accuracy, and in practice parameter optimization is essential. So what parameters should be set and how should they be set to improve the classification accuracy of the classifier? You can tune the parameters one by one by hand, but this is very difficult. It seems that there may be some knowledge that this value is customarily good depending on the data set and method, but it cannot be used for unknown data sets. Therefore, a method called ** grid search ** is often used. Simply put, the model is actually trained while changing the parameters in the search range, and the parameter with the best result accuracy is searched for. In addition, ** Cross-validation ** is used to confirm that the learning model with the obtained parameters is not overfitting. The k-validation method first divides the data into k pieces. It is a method of learning with k-1 of them and evaluating with the remaining one, repeating k times (while changing the training data and test data), and evaluating the learning model with the average value. By doing this, you can evaluate the ** generalization performance ** of the learning model.

(Note) Good generalization performance is simply the ability of the learning model to properly identify unknown data. Recall that if you are overfitting, you will be able to identify training data with high accuracy, but you will be less accurate with unknown data.

With scikit-learn, you can easily perform grid search and cross-validation using GridSearchCV (). For example, you can specify the following parameters:

Preparation

Before performing parameter optimization, convert the format of the read data.

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

#Loading Digits dataset
digits = datasets.load_digits()
print(len(digits.images))
print(digits.images.shape)
1797
(1797, 8, 8)
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))  # reshape(cols, rows)Convert to cols rows and rows column(One of the arguments-Automatic calculation if 1)
y = digits.target
print(X.shape)
print(y)
(1797, 64)
[0 1 2 ..., 8 9 8]

Grid search and cross-validation method

The code below may seem daunting, but what you're actually doing is simple.

You're just trying all the combinations of cases above to find the parameter (best \ _params \ _) that maximizes each precision and recall. (Note that gamma is a parameter when the kernel is rbf, so it is irrelevant when the kernel is linear)

After that, the result of grid search is displayed in detail, and the detailed report of the result is displayed by classification_report ().

#Divide the dataset into training data and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

#Set the parameters you want to optimize with cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    #Grid search and cross-validation method
    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                       scoring='%s_weighted' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()  
# Tuning hyper-parameters for precision

Best parameters set found on development set:

{'gamma': 0.001, 'kernel': 'rbf', 'C': 10}

Grid scores on development set:

0.987 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
0.959 (+/-0.030) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
0.982 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
0.982 (+/-0.026) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
0.988 (+/-0.018) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
0.982 (+/-0.026) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 10}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 100}
0.974 (+/-0.014) for {'kernel': 'linear', 'C': 1000}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899


# Tuning hyper-parameters for recall

Best parameters set found on development set:

{'gamma': 0.001, 'kernel': 'rbf', 'C': 10}

Grid scores on development set:

0.986 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1}
0.958 (+/-0.029) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 10}
0.981 (+/-0.029) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 10}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 100}
0.981 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 100}
0.987 (+/-0.021) for {'gamma': 0.001, 'kernel': 'rbf', 'C': 1000}
0.981 (+/-0.027) for {'gamma': 0.0001, 'kernel': 'rbf', 'C': 1000}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 1}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 10}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 100}
0.973 (+/-0.015) for {'kernel': 'linear', 'C': 1000}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92

avg / total       0.99      0.99      0.99       899

Now, from the result of print (clf.best \ _params \ _),'gamma': 0.001,'kernel':'rbf','C': 10 are the best from both viewpoints of precision / recall. I understand this. Now you have optimized the parameters.

If necessary, try optimizing with a different kernel or when compared to learning methods other than SVM.

reference

[1]An introduction to machine learning with scikit-learn — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction [2]Parameter estimation using grid search with cross-validation — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html#example-model-selection-grid-search-digits-py [3]1.4. Support Vector Machines — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/modules/svm.html [4] F-number-Machine learning "Toki no Mori Wiki" http://ibisforest.org/index.php?F%E5%80%A4 [5] Master SVM! 8 checkpoints-Qiita http://qiita.com/pika_shi/items/5e59bcf69e85fdd9edb2 [6] Parameter optimization by grid search from Scikit learn http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a [7] Introduction to Bayesian Optimization for Machine Learning | Tech Book Zone Manatee https://book.mynavi.jp/manatee/detail/id=59393 [8]3.3. Model evaluation: quantifying the quality of predictions — scikit-learn 0.18.1 documentation http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter [9] Overfitting-Wikipedia https://ja.wikipedia.org/wiki/%E9%81%8E%E5%89%B0%E9%81%A9%E5%90%88

Recommended Posts

Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Introduction to machine learning
[Python] Easy introduction to machine learning with python (SVM)
An introduction to machine learning
Super introduction to machine learning
Machine learning in Delemas (data acquisition)
Introduction to Machine Learning Library SHOGUN
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Preprocessing in machine learning 2 Data acquisition
How to collect machine learning data
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Introduction to Machine Learning: How Models Work
Machine learning imbalanced data sklearn with k-NN
An introduction to OpenCV for machine learning
Python: Preprocessing in machine learning: Data acquisition
Introduction to ClearML-Easy to manage machine learning experiments-
An introduction to Python for machine learning
[Introduction to machine learning] Until you run the sample code with chainer
I started machine learning with Python (I also started posting to Qiita) Data preparation
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬
[Introduction to minimize] Data analysis with SEIR model ♬
[Super Introduction to Machine Learning] Learn Pytorch tutorials
I started machine learning with Python Data preprocessing
[For beginners] Introduction to vectorization in machine learning
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
I tried to move machine learning (ObjectDetection) with TouchDesigner
Try to predict forex (FX) with non-deep machine learning
Reading Note: An Introduction to Data Analysis with Python
Site summary to learn machine learning with English video
An introduction to machine learning from a simple perceptron
Machine learning learned with Pokemon
Data set for machine learning
[Learning memorandum] Introduction to vim
Uncle SE with hardened brain tried to study machine learning
Machine learning with Python! Preparation
Made icrawler easier to use for machine learning data collection
Introduction to RDB with sqlalchemy Ⅰ
Introduction to Nonlinear Optimization (I)
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
For those who want to start machine learning with TensorFlow2
Machine learning Minesweeper with PyTorch
Introduction to Deep Learning ~ Learning Rules ~
Machine learning and mathematical optimization
Try to predict if tweets will burn with machine learning
Beginning with Python machine learning
An introduction to Bayesian optimization
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Try machine learning with Kaggle
Introduction to Deep Learning ~ Backpropagation ~
[Introduction to Python] How to get data with the listdir function
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
Use with Cabocha to automatically generate "IOB2 tag corpus" learning data
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
[Machine learning] Check the performance of the classifier with handwritten character data
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Introduction to Data Analysis with Python P32-P43 [ch02 3.US Baby Names 1880-2010]
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]