What I learned in two months before the product was released as a machine learning fucking amateur

A little machine learning with scikit-learn is easier than making a little web app with LAMP. Below is a summary of what I learned in the two months since I started.

highlight

Three things I thought were important
Overview of machine learning
About scikit learn
A word about Google Prediction API, Mahout, Spark, Cython

3 things I thought were important

1. Have friends who are good at machine learning

As with any technology, you can confidently proceed by asking a friend to get an overview first. Having this confidence makes it harder for you to break your heart. I was told by @fukkyy that "Introductory sites use difficult terms to scare me, but machine learning is not scary if I use a library", and I ignored the introductory sites in the street and started hitting the library, so it was easy to enter. .. We received various advice from @ ysks3n, such as dimension deletion and proper use of each algorithm. If you don't mind the basics, I can teach you hands-on, so call me if you are interested.

\ #Thank you for both names!

2. Reduce the vertical and horizontal size of the training data

If it is vertical and horizontal, there is a problem of degree, but cutting the horizontal often contributes to performance.

The amount of data that I could handle in the library at will was much lower than I expected. There was an unfounded belief that if the data was about 30GB, it would be possible to do without any processing. There is a library in scikit-learn that performs dimensional compression (horizontal size reduction), so it is good to use it. I wrote the method of dimension compression in the sample code below. If you start from a small dimension and increase it, the correct answer rate may increase, so it is better to stop at that point. Vertical data should also be deleted if it is not related to learning.

3. Writing code is faster than studying theory

As I wrote above, there are some difficult introductory sites in the streets, but I think that beginners can use the sample code before studying the theory. Also, machine learning uses various models as appropriate for learning, but if you study each model in detail, it is written that it is difficult and it tends to waste heat, so if you use scikit-learn, you should use a model. I think you should try all of them and use the one with the highest accuracy rate. You can switch models just by rewriting one line. I wrote the sample code at the bottom. If you need more accuracy and can afford it, study.

Machine learning overview

There are two main types of machine learning (with supervised learning).

1. Regression

A method for predicting numerical values from data. For example, you can use it to learn the relationship between temperature and past weather, location, and date, and then predict the temperature the next time you enter the weather and location, as shown below.

First, learn the following data, and learn the relationship between features (weather, location, date) and temperature.

temperature	weather	place	time
19	Fine	Tokyo	111
29	rain	Osaka	311
19	Cloudy	Fukuoka	121
29	Fine	Kumamoto	11
39	rain	Kyoto	311
29	Cloudy	Aichi	131
19	Fine	Nara	211
9	rain	Ishikawa	141
49	Cloudy	hell	151

Next, use the learning results to predict the temperature when the feature is given.

temperature	weather	place	date
?	rain	Tokyo	111

2. Classification

A method for predicting classification from data. For example, you can learn the relationship between the book you read and the classification of men and women, and use it to classify the person's gender the next time you enter the information of the book you read.

First, learn the following data, and learn the relationship between characteristics (whether you read each comic or not) and gender.

sex	The Legend of the Strongest Kurosawa	Hunter hunter	Glass mask
Man	read	read	Not read
Man	Not read	read	Not read
woman	Not read	read	read
Man	Not read	read	Not read
woman	Not read	read	read
Man	read	read	Not read
Man	read	read	read
woman	Not read	read	read
Man	read	read	Not read

Then use the learning results to predict the gender when the feature is given.

sex	The Legend of the Strongest Kurosawa	Hunter hunter	Glass mask
?	read	read	read

\ # There is also unsupervised learning clustering, but I'm not dealing with it, so I'll exclude it this time.

About scikit learn

It's a dream-like library where anyone can do machine learning.

`sk_learn_sample.py`


# -*- coding: utf-8 -*-
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier ,GradientBoostingClassifier, RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn import datasets
from sklearn.cross_validation import cross_val_score


#Prepare training data
iris     = datasets.load_iris() #Sample data attached to the library
features = iris.data            #Feature data
                                #In the above classification example, it corresponds to the weather, place, date and whether or not you read each manga.
labels   = iris.target          #Correct answer data for features
                                #In the above classification example, it corresponds to temperature and gender.

#Compress the feature dimension
#Treat features of similar nature as the same
lsa = TruncatedSVD(2)
reduced_features = lsa.fit_transform(features)

#I'm not sure which model to use, so I had an eye on it. For the time being, kill everyone with the default settings.
clf_names = ["LinearSVC","AdaBoostClassifier","ExtraTreesClassifier" ,"GradientBoostingClassifier","RandomForestClassifier"]
for clf_name in clf_names:
  clf    = eval("%s()" % clf_name) 
  scores = cross_val_score(clf,reduced_features, labels,cv=5)
  score  = sum(scores) / len(scores)  #Measure the correct answer rate of the model
  print "%s score:%s" % (clf_name,score)

#LinearSVC score:0.973333333333
#AdaBoost Classifier score:0.973333333333
#ExtraTrees Classifier score:0.973333333333
#GradientBoostingClassifier score:0.966666666667
#RandomForestClassifier score:0.933333333333

A word about Google Prediction API, Mahout, Spark, Cython

Google Prediction API The introduction costs less than sk-learn, but the drawback is that the amount of data that can be handled is small.

Mahout It seems to be a tool that can perform distributed machine learning using Hadoop, but the person who wrote the Mahout series in Japanese recommended Spark to Mahout. This is an Apache project.

Spark I don't know much about the difference from Mahout, but a great person recommended this one. Scikit-learn worked pretty well, so I just touched it and stopped immediately. It's also an Apache project.

Cython It's explosive, but sometimes the code that worked in Python doesn't work in Cython, or the parallel-processed process is still alive. If you have a problem at a low level, you may want to try running it in Python normally.

Summary

If you have a detailed friend, ask that person. Ask me if you don't have any friends. Let's exchange technology. For the time being, please install scikit-learn, paste the above sample code, and execute it.

\ # The product I made this time is outsourced, so I haven't opened it. Not bad. It was a web application that solved \ # problems like this.