A little machine learning with scikit-learn is easier than making a little web app with LAMP. Below is a summary of what I learned in the two months since I started.
As with any technology, you can confidently proceed by asking a friend to get an overview first. Having this confidence makes it harder for you to break your heart. I was told by @fukkyy that "Introductory sites use difficult terms to scare me, but machine learning is not scary if I use a library", and I ignored the introductory sites in the street and started hitting the library, so it was easy to enter. .. We received various advice from @ ysks3n, such as dimension deletion and proper use of each algorithm. If you don't mind the basics, I can teach you hands-on, so call me if you are interested.
\ #Thank you for both names!
The amount of data that I could handle in the library at will was much lower than I expected. There was an unfounded belief that if the data was about 30GB, it would be possible to do without any processing. There is a library in scikit-learn that performs dimensional compression (horizontal size reduction), so it is good to use it. I wrote the method of dimension compression in the sample code below. If you start from a small dimension and increase it, the correct answer rate may increase, so it is better to stop at that point. Vertical data should also be deleted if it is not related to learning.
As I wrote above, there are some difficult introductory sites in the streets, but I think that beginners can use the sample code before studying the theory. Also, machine learning uses various models as appropriate for learning, but if you study each model in detail, it is written that it is difficult and it tends to waste heat, so if you use scikit-learn, you should use a model. I think you should try all of them and use the one with the highest accuracy rate. You can switch models just by rewriting one line. I wrote the sample code at the bottom. If you need more accuracy and can afford it, study.
There are two main types of machine learning (with supervised learning).
A method for predicting numerical values from data. For example, you can use it to learn the relationship between temperature and past weather, location, and date, and then predict the temperature the next time you enter the weather and location, as shown below.
temperature | weather | place | time |
---|---|---|---|
19 | Fine | Tokyo | 111 |
29 | rain | Osaka | 311 |
19 | Cloudy | Fukuoka | 121 |
29 | Fine | Kumamoto | 11 |
39 | rain | Kyoto | 311 |
29 | Cloudy | Aichi | 131 |
19 | Fine | Nara | 211 |
9 | rain | Ishikawa | 141 |
49 | Cloudy | hell | 151 |
temperature | weather | place | date |
---|---|---|---|
? | rain | Tokyo | 111 |
A method for predicting classification from data. For example, you can learn the relationship between the book you read and the classification of men and women, and use it to classify the person's gender the next time you enter the information of the book you read.
sex | The Legend of the Strongest Kurosawa | Hunter hunter | Glass mask |
---|---|---|---|
Man | read | read | Not read |
Man | Not read | read | Not read |
woman | Not read | read | read |
Man | Not read | read | Not read |
woman | Not read | read | read |
Man | read | read | Not read |
Man | read | read | read |
woman | Not read | read | read |
Man | read | read | Not read |
sex | The Legend of the Strongest Kurosawa | Hunter hunter | Glass mask |
---|---|---|---|
? | read | read | read |
\ # There is also unsupervised learning clustering, but I'm not dealing with it, so I'll exclude it this time.
It's a dream-like library where anyone can do machine learning.
sk_learn_sample.py
# -*- coding: utf-8 -*-
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier ,GradientBoostingClassifier, RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn import datasets
from sklearn.cross_validation import cross_val_score
#Prepare training data
iris = datasets.load_iris() #Sample data attached to the library
features = iris.data #Feature data
#In the above classification example, it corresponds to the weather, place, date and whether or not you read each manga.
labels = iris.target #Correct answer data for features
#In the above classification example, it corresponds to temperature and gender.
#Compress the feature dimension
#Treat features of similar nature as the same
lsa = TruncatedSVD(2)
reduced_features = lsa.fit_transform(features)
#I'm not sure which model to use, so I had an eye on it. For the time being, kill everyone with the default settings.
clf_names = ["LinearSVC","AdaBoostClassifier","ExtraTreesClassifier" ,"GradientBoostingClassifier","RandomForestClassifier"]
for clf_name in clf_names:
clf = eval("%s()" % clf_name)
scores = cross_val_score(clf,reduced_features, labels,cv=5)
score = sum(scores) / len(scores) #Measure the correct answer rate of the model
print "%s score:%s" % (clf_name,score)
#LinearSVC score:0.973333333333
#AdaBoost Classifier score:0.973333333333
#ExtraTrees Classifier score:0.973333333333
#GradientBoostingClassifier score:0.966666666667
#RandomForestClassifier score:0.933333333333
Google Prediction API The introduction costs less than sk-learn, but the drawback is that the amount of data that can be handled is small.
Mahout It seems to be a tool that can perform distributed machine learning using Hadoop, but the person who wrote the Mahout series in Japanese recommended Spark to Mahout. This is an Apache project.
Spark I don't know much about the difference from Mahout, but a great person recommended this one. Scikit-learn worked pretty well, so I just touched it and stopped immediately. It's also an Apache project.
Cython It's explosive, but sometimes the code that worked in Python doesn't work in Cython, or the parallel-processed process is still alive. If you have a problem at a low level, you may want to try running it in Python normally.
If you have a detailed friend, ask that person. Ask me if you don't have any friends. Let's exchange technology. For the time being, please install scikit-learn, paste the above sample code, and execute it.
\ # The product I made this time is outsourced, so I haven't opened it. Not bad. It was a web application that solved \ # problems like this.
Recommended Posts