This article is a brush-up of the hands-on study session Tech-Circle Let's Start Application Development Using Machine Learning ... This is a customized version of the contents of the PyCon 2015 Tutorial announced in the above.
The following article summarizes the procedure for building an environment, so please set it up according to your environment.
Building a machine learning application development environment with Python
First, I will explain the outline of machine learning.
Hands-on start from P37. The slides are the explanations, and this article is the procedure for moving your hands, so please read the explanations on the slides-> try the procedure.
** Hands-on is based on Python 3 **
** Fork ** the following GitHub repository and download from there icoxfog417/number_recognizer
(The following assumes that you have prepared as prepared in advance (create virtual environment ml_env with conda). If you have changed it, please read as appropriate).
Windows
activate ml_env
Mac/Linux
#To prevent batting with pyenv, activate checks the path of the virtual environment and executes it directly
conda info -e
# conda environments:
#
ml_env /usr/local/pyenv/versions/miniconda3-3.4.2/envs/ml_env
root * /usr/local/pyenv/versions/miniconda3-3.4.2
source /usr/local/pyenv/versions/miniconda3-3.4.2/envs/ml_env/bin/activate ml_env
Application (started at localhost: 3000)
Execute run_application.py
directly under the number_recognition
folder.
python run_application.py
IPython notebook for building machine learning models (started on localhost: 8888)
Directly under number_recognition / machines / number_recognizer
, execute the following.
ipython notebook
The application feels really bad at first. I will make this smarter.
Open the iPython notebook. Here you will find each step of machine learning in order. Since the code in the sentence can actually be executed with iPython notebook, let's explain and execute it in order from the top (see here for detailed usage. Please give me).
If you go to the last save, the model (number_recognition / machines / number_recognizer / machine.pkl
) should actually be updated.
Handson # 2 Explanation
Here, we will carry out the following two things.
Use cross_validation.train_test_split
to split the training data.
Use this and put the following processing before Training the Model.
def split_dataset(dataset, test_size=0.3):
from sklearn import cross_validation
from collections import namedtuple
DataSet = namedtuple("DataSet", ["data", "target"])
train_d, test_d, train_t, test_t = cross_validation.train_test_split(dataset.data, dataset.target, test_size=test_size, random_state=0)
left = DataSet(train_d, train_t)
right = DataSet(test_d, test_t)
return left, right
# use 30% of data to test the model
training_set, test_set = split_dataset(digits, 0.3)
print("dataset is splited to train/test = {0} -> {1}, {2}".format(
len(digits.data), len(training_set.data), len(test_set.data))
)
Since we have split the data into training_set
and test_set
above, modify Training the Model as follows.
classifier.fit(training_set.data, training_set.target)
Learning is now complete. Thanks to the split data, we have 30% of the data for evaluation. You can use it to measure the accuracy of untrained data.
Modify the accuracy calculation part of Evaluate the Model as follows.
print(calculate_accuracy(classifier, training_set))
print(calculate_accuracy(classifier, test_set))
Handson # 3 Explanation
Check the transition of accuracy for training / evaluation data with the following script.
This graph with the number of training data on the horizontal axis and the accuracy on the vertical axis is called the learning curver. In scikit-learn, it is sklearn.learning_curve
. You can easily draw by using /modules/generated/sklearn.learning_curve.learning_curve.html).
def plot_learning_curve(model_func, dataset):
from sklearn.learning_curve import learning_curve
import matplotlib.pyplot as plt
import numpy as np
sizes = [i / 10 for i in range(1, 11)]
train_sizes, train_scores, valid_scores = learning_curve(model_func(), dataset.data, dataset.target, train_sizes=sizes, cv=5)
take_means = lambda s: np.mean(s, axis=1)
plt.plot(sizes, take_means(train_scores), label="training")
plt.plot(sizes, take_means(valid_scores), label="test")
plt.ylim(0, 1.1)
plt.title("learning curve")
plt.legend(loc="lower right")
plt.show()
plot_learning_curve(make_model, digits)
When you have finished adding it, try running it. The figure should be plotted as shown below.
In scikit-learn, you can easily check by using the classification_report function. Confusion_matrix is an analysis of how many of the concrete predictions were correct within each label (# 0 to # 9). You can do this with sklearn.metrics.confusion_matrix.html # sklearn.metrics.confusion_matrix).
def show_confusion_matrix(model, dataset):
from sklearn.metrics import classification_report
predicted = model.predict(dataset.data)
target_names = ["#{0}".format(i) for i in range(0, 10)]
print(classification_report(dataset.target, predicted, target_names=target_names))
show_confusion_matrix(classifier, digits)
Handson Advanced
Try pressing the Heroku button.
By using conda-buildpack, you can build an application environment with conda on Heroku. This makes it easy to run machine learning applications on Heroku. Please refer to here for details.
Use GridSearch to find out which parameter has the highest accuracy while changing the parameters of the model. In scikit-learn, this search is possible by using GridSearchCV.
Please try tuning by inserting the following code before Evaluate the Model.
def tuning_model(model_func, dataset):
from sklearn.grid_search import GridSearchCV
candidates = [
{"loss": ["hinge", "log"],
"alpha": [1e-5, 1e-4, 1e-3]
}]
searcher = GridSearchCV(model_func(), candidates, cv=5, scoring="f1_weighted")
searcher.fit(dataset.data, dataset.target)
for params, mean_score, scores in sorted(searcher.grid_scores_, key=lambda s: s[1], reverse=True):
print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params))
return searcher.best_estimator_
classifier = tuning_model(make_model, digits)
This application is designed to give you the correct answer if the predicted numbers are different. Its value is stored as feedback.txt in the data folder and is used to train the model (https://github.com/icoxfog417/number_recognizer/blob/master/application/server.py#L43) ..
Again, please check how learning will change.
※Caution
Recommended Posts