What I learned in two months before the product was released as a machine learning fucking amateur

A little machine learning with scikit-learn is easier than making a little web app with LAMP. Below is a summary of what I learned in the two months since I started.

highlight

  1. Three things I thought were important
  2. Overview of machine learning
  3. About scikit learn
  4. A word about Google Prediction API, Mahout, Spark, Cython

3 things I thought were important

1. Have friends who are good at machine learning

As with any technology, you can confidently proceed by asking a friend to get an overview first. Having this confidence makes it harder for you to break your heart. I was told by @fukkyy that "Introductory sites use difficult terms to scare me, but machine learning is not scary if I use a library", and I ignored the introductory sites in the street and started hitting the library, so it was easy to enter. .. We received various advice from @ ysks3n, such as dimension deletion and proper use of each algorithm. If you don't mind the basics, I can teach you hands-on, so call me if you are interested.

\ #Thank you for both names!

2. Reduce the vertical and horizontal size of the training data

If it is vertical and horizontal, there is a problem of degree, but cutting the horizontal often contributes to performance.

The amount of data that I could handle in the library at will was much lower than I expected. There was an unfounded belief that if the data was about 30GB, it would be possible to do without any processing. There is a library in scikit-learn that performs dimensional compression (horizontal size reduction), so it is good to use it. I wrote the method of dimension compression in the sample code below. If you start from a small dimension and increase it, the correct answer rate may increase, so it is better to stop at that point. Vertical data should also be deleted if it is not related to learning.

3. Writing code is faster than studying theory

As I wrote above, there are some difficult introductory sites in the streets, but I think that beginners can use the sample code before studying the theory. Also, machine learning uses various models as appropriate for learning, but if you study each model in detail, it is written that it is difficult and it tends to waste heat, so if you use scikit-learn, you should use a model. I think you should try all of them and use the one with the highest accuracy rate. You can switch models just by rewriting one line. I wrote the sample code at the bottom. If you need more accuracy and can afford it, study.

Machine learning overview

There are two main types of machine learning (with supervised learning).

1. Regression

A method for predicting numerical values from data. For example, you can use it to learn the relationship between temperature and past weather, location, and date, and then predict the temperature the next time you enter the weather and location, as shown below.

First, learn the following data, and learn the relationship between features (weather, location, date) and temperature.

temperature weather place time
19 Fine Tokyo 111
29 rain Osaka 311
19 Cloudy Fukuoka 121
29 Fine Kumamoto 11
39 rain Kyoto 311
29 Cloudy Aichi 131
19 Fine Nara 211
9 rain Ishikawa 141
49 Cloudy hell 151

Next, use the learning results to predict the temperature when the feature is given.

temperature weather place date
? rain Tokyo 111

2. Classification

A method for predicting classification from data. For example, you can learn the relationship between the book you read and the classification of men and women, and use it to classify the person's gender the next time you enter the information of the book you read.

First, learn the following data, and learn the relationship between characteristics (whether you read each comic or not) and gender.

sex The Legend of the Strongest Kurosawa Hunter hunter Glass mask
Man read read Not read
Man Not read read Not read
woman Not read read read
Man Not read read Not read
woman Not read read read
Man read read Not read
Man read read read
woman Not read read read
Man read read Not read

Then use the learning results to predict the gender when the feature is given.

sex The Legend of the Strongest Kurosawa Hunter hunter Glass mask
? read read read

\ # There is also unsupervised learning clustering, but I'm not dealing with it, so I'll exclude it this time.

About scikit learn

It's a dream-like library where anyone can do machine learning.

sk_learn_sample.py


# -*- coding: utf-8 -*-
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier ,GradientBoostingClassifier, RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn import datasets
from sklearn.cross_validation import cross_val_score


#Prepare training data
iris     = datasets.load_iris() #Sample data attached to the library
features = iris.data            #Feature data
                                #In the above classification example, it corresponds to the weather, place, date and whether or not you read each manga.
labels   = iris.target          #Correct answer data for features
                                #In the above classification example, it corresponds to temperature and gender.

#Compress the feature dimension
#Treat features of similar nature as the same
lsa = TruncatedSVD(2)
reduced_features = lsa.fit_transform(features)

#I'm not sure which model to use, so I had an eye on it. For the time being, kill everyone with the default settings.
clf_names = ["LinearSVC","AdaBoostClassifier","ExtraTreesClassifier" ,"GradientBoostingClassifier","RandomForestClassifier"]
for clf_name in clf_names:
  clf    = eval("%s()" % clf_name) 
  scores = cross_val_score(clf,reduced_features, labels,cv=5)
  score  = sum(scores) / len(scores)  #Measure the correct answer rate of the model
  print "%s score:%s" % (clf_name,score)

#LinearSVC score:0.973333333333
#AdaBoost Classifier score:0.973333333333
#ExtraTrees Classifier score:0.973333333333
#GradientBoostingClassifier score:0.966666666667
#RandomForestClassifier score:0.933333333333

A word about Google Prediction API, Mahout, Spark, Cython

Google Prediction API The introduction costs less than sk-learn, but the drawback is that the amount of data that can be handled is small.

Mahout It seems to be a tool that can perform distributed machine learning using Hadoop, but the person who wrote the Mahout series in Japanese recommended Spark to Mahout. This is an Apache project.

Spark I don't know much about the difference from Mahout, but a great person recommended this one. Scikit-learn worked pretty well, so I just touched it and stopped immediately. It's also an Apache project.

Cython It's explosive, but sometimes the code that worked in Python doesn't work in Cython, or the parallel-processed process is still alive. If you have a problem at a low level, you may want to try running it in Python normally.

Summary

If you have a detailed friend, ask that person. Ask me if you don't have any friends. Let's exchange technology. For the time being, please install scikit-learn, paste the above sample code, and execute it.

\ # The product I made this time is outsourced, so I haven't opened it. Not bad. It was a web application that solved \ # problems like this.

Recommended Posts

What I learned in two months before the product was released as a machine learning fucking amateur
What I learned before a system engineer who could not meet the delivery date was a little dependable without delay
What I learned about AI / machine learning using Python (1)
What I learned about AI / machine learning using Python (3)
What I learned by participating in the ISUCON10 qualifying
What I learned about AI / machine learning using Python (2)
I tried to compare the accuracy of machine learning models using kaggle as a theme.
The story I was addicted to when I specified nil as a function argument in Go
A word that I was interested in as a programming beginner
I wrote a script that splits the image in two
What I learned in Python
What I have learned in the past year as an elderly person (unsuitable for my age)
What I learned by writing a Python Pull Request for the first time in my life
Although I knew that the machine learning course in the example was good, I continued to go through it for two years, but it was still good
People memorize learned knowledge in the brain, how to memorize learned knowledge in machine learning
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.