This article is a brush-up of the hands-on study session Tech-Circle Let's Start Application Development Using Machine Learning ... This is a customized version of the contents of the PyCon 2015 Tutorial announced in the above.

goal

Understand the outline of machine learning
Get hints on how to use machine learning
Get a machine learning development environment

Advance preparation

The following article summarizes the procedure for building an environment, so please set it up according to your environment.

Building a machine learning application development environment with Python

Commentary on machine learning

First, I will explain the outline of machine learning.

Machine Learning Bootstrap

Hands-on start from P37. The slides are the explanations, and this article is the procedure for moving your hands, so please read the explanations on the slides-> try the procedure.

Hands-on procedure

** Hands-on is based on Python 3 **

# 0 Preparation

# 0-1 Download source code

** Fork ** the following GitHub repository and download from there icoxfog417/number_recognizer

You can clone it as it is, but if you do not fork, you will not be able to reflect what you have learned.

# 0-2 Application operation check

Enable virtual environment

(The following assumes that you have prepared as prepared in advance (create virtual environment ml_env with conda). If you have changed it, please read as appropriate).

Windows

activate ml_env

Mac/Linux

#To prevent batting with pyenv, activate checks the path of the virtual environment and executes it directly
conda info -e
# conda environments:
#
ml_env                   /usr/local/pyenv/versions/miniconda3-3.4.2/envs/ml_env
root                  *  /usr/local/pyenv/versions/miniconda3-3.4.2

source /usr/local/pyenv/versions/miniconda3-3.4.2/envs/ml_env/bin/activate ml_env

Operation check

Application (started at localhost: 3000)

Execute run_application.py directly under the number_recognition folder.

python run_application.py

IPython notebook for building machine learning models (started on localhost: 8888)

Directly under number_recognition / machines / number_recognizer, execute the following.

ipython notebook

The application feels really bad at first. I will make this smarter.

# 1 Experience the process of creating a machine learning model

Open the iPython notebook. Here you will find each step of machine learning in order. Since the code in the sentence can actually be executed with iPython notebook, let's explain and execute it in order from the top (see here for detailed usage. Please give me).

If you go to the last save, the model (number_recognition / machines / number_recognizer / machine.pkl) should actually be updated.

# 2 Divide the training data

Handson # 2 Explanation

Here, we will carry out the following two things.

Divide the data into learning and evaluation
Calculate the accuracy for training data and the accuracy for evaluation data, respectively.

Use cross_validation.train_test_split to split the training data. Use this and put the following processing before Training the Model.

def split_dataset(dataset, test_size=0.3):
    from sklearn import cross_validation
    from collections import namedtuple

    DataSet = namedtuple("DataSet", ["data", "target"])
    train_d, test_d, train_t, test_t = cross_validation.train_test_split(dataset.data, dataset.target, test_size=test_size, random_state=0)

    left = DataSet(train_d, train_t)
    right = DataSet(test_d, test_t)
    
    return left, right

# use 30% of data to test the model
training_set, test_set = split_dataset(digits, 0.3)
print("dataset is splited to train/test = {0} -> {1}, {2}".format(
        len(digits.data), len(training_set.data), len(test_set.data))
     )

Since we have split the data into training_set and test_set above, modify Training the Model as follows.

classifier.fit(training_set.data, training_set.target)

Learning is now complete. Thanks to the split data, we have 30% of the data for evaluation. You can use it to measure the accuracy of untrained data.

Modify the accuracy calculation part of Evaluate the Model as follows.

print(calculate_accuracy(classifier, training_set))
print(calculate_accuracy(classifier, test_set))

# 3 Evaluate the model

Handson # 3 Explanation

Measure how the accuracy of each training data and evaluation data changes to check if it is overfitting.
Check the precision and recall to prevent "what a highly accurate model" due to data imbalance.

Confirmation of accuracy for training / evaluation data

Check the transition of accuracy for training / evaluation data with the following script. This graph with the number of training data on the horizontal axis and the accuracy on the vertical axis is called the learning curver. In scikit-learn, it is sklearn.learning_curve. You can easily draw by using /modules/generated/sklearn.learning_curve.learning_curve.html).

def plot_learning_curve(model_func, dataset):
    from sklearn.learning_curve import learning_curve
    import matplotlib.pyplot as plt
    import numpy as np

    sizes = [i / 10 for i in range(1, 11)]
    train_sizes, train_scores, valid_scores = learning_curve(model_func(), dataset.data, dataset.target, train_sizes=sizes, cv=5)
    
    take_means = lambda s: np.mean(s, axis=1)
    plt.plot(sizes, take_means(train_scores), label="training")
    plt.plot(sizes, take_means(valid_scores), label="test")
    plt.ylim(0, 1.1)
    plt.title("learning curve")
    plt.legend(loc="lower right")
    plt.show()

plot_learning_curve(make_model, digits)

When you have finished adding it, try running it. The figure should be plotted as shown below.

Confirmation of conformance rate and recall rate

In scikit-learn, you can easily check by using the classification_report function. Confusion_matrix is an analysis of how many of the concrete predictions were correct within each label (# 0 to # 9). You can do this with sklearn.metrics.confusion_matrix.html # sklearn.metrics.confusion_matrix).

def show_confusion_matrix(model, dataset):
    from sklearn.metrics import classification_report
    
    predicted = model.predict(dataset.data)
    target_names = ["#{0}".format(i) for i in range(0, 10)]

    print(classification_report(dataset.target, predicted, target_names=target_names))

show_confusion_matrix(classifier, digits)

Handson Advanced

Deploy to Heroku

Try pressing the Heroku button.

By using conda-buildpack, you can build an application environment with conda on Heroku. This makes it easy to run machine learning applications on Heroku. Please refer to here for details.

Model tuning

Use GridSearch to find out which parameter has the highest accuracy while changing the parameters of the model. In scikit-learn, this search is possible by using GridSearchCV.

Please try tuning by inserting the following code before Evaluate the Model.

def tuning_model(model_func, dataset):
    from sklearn.grid_search import GridSearchCV
    
    candidates = [
        {"loss": ["hinge", "log"],
         "alpha": [1e-5, 1e-4, 1e-3]
        }]
    
    searcher = GridSearchCV(model_func(), candidates, cv=5, scoring="f1_weighted")
    searcher.fit(dataset.data, dataset.target)
    
    for params, mean_score, scores in sorted(searcher.grid_scores_, key=lambda s: s[1], reverse=True):
        print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
    
    return searcher.best_estimator_
    
classifier = tuning_model(make_model, digits)

Online machine learning

This application is designed to give you the correct answer if the predicted numbers are different. Its value is stored as feedback.txt in the data folder and is used to train the model (https://github.com/icoxfog417/number_recognizer/blob/master/application/server.py#L43) ..

Again, please check how learning will change.

※Caution

Nothing is more reliable than user input, so you usually don't train in this style. It is wise to collect the data once, exclude the irrelevant data, and then train it.
It is expected that learning will proceed in an unexpected direction if it is learned dynamically. Therefore, when adopting online machine learning, it is necessary to carefully design it.

Other

Effective use of datasets: [Cross Validation](http://nbviewer.ipython.org/github/icoxfog417/scikit-learn-notebook/blob/master/scikit-learn-tutorial.ipynb#Split-the- Data)
Combine multiple learning machines to improve accuracy: [Ensemble Learning](http://nbviewer.ipython.org/github/icoxfog417/scikit-learn-notebook/blob/master/scikit-learn-tutorial.ipynb#Ensemble #Ensemble -Lerning)
For deeper evaluation of model accuracy: [Confusion Matrix](http://nbviewer.ipython.org/github/icoxfog417/scikit-learn-notebook/blob/master/scikit-learn-tutorial.ipynb#Evaluate- Training-Result)
Learn more about how to organize your data: [Arrange the Data](http://nbviewer.ipython.org/github/icoxfog417/scikit-learn-notebook/blob/master/scikit-learn-tutorial.ipynb#Arrange- the-Data)

Tech-Circle Let's start application development using machine learning (self-study)