I usually use SVM, but I had to study the learner, so I conducted an experiment at Random Forest.
Reference material http://d.hatena.ne.jp/shakezo/20121221/1356089207 When Basics of Statistical Machine Learning (http://www.kyoritsu-pub.co.jp/bookdetail/9784320123625)
Create multiple test data by bootstrap sampling and create multiple test data.
Finally, the classes are classified by a majority vote of each decision tree.
Numerical prediction is also possible if the decision tree is a regression tree.
The difference from bagging is that Random Forest also samples the objective variable at the same time.
Generalization error = Bias + Variance + Noise
When
A learning tree such as a decision tree is a weak learning tree with high variance (the prediction result is significantly different if the data is slightly different).
The strategy is to lower the variance by creating a lot of learning trees like this.
Hyperparameters are mainly
num_trees: How many decision trees to create
max_depth: Which depth decision tree to create
num_features: How many objective variables to sample when sampling objective variables
Is.
The book says that random_forest does not overfit, so it does not depend on max_depth, and that num_features should be sqrt (features). Is that true? So I conducted an experiment.
I used mnist for the data.
However, I felt that the original amount of data was large, so I sampled each number 2000.
10000 test data as it is
num_trees
First, when you change num_trees to 10, 100, 1000, 10000,
max_depth is until the end
num_features fixed to sqrt (features)
After all it seems that the accuracy is low if it is too small, but it makes no sense if it is too much, about 1000 seems to be appropriate
depth
Next time you change the depth
10 types of 2, 4, 6, 8, 10, 12, 14, 16, 18, 20
num_trees fixed to 1000, max_features fixed to sqrt (features)
Certainly, even if you search deeply, it seems that overfitting will not occur, it is okay to search deeply
num_features
When changing num_features to 10, 15, 20, 25, 30, 35, 40, 45, 50, 55,
num_trees up to 1000, depth up to max
Is 35 or 40 the peak? The red line is the default sprt,
Was it better this time to have a little more than the default?
Finally about the execution time
num_trees=1000 max_depth = max max_features = 40 n_jobs = 6
Experiment with only CPU
The average of 3 times is 28.8 seconds
As a benchmark, SVM also classified MNIST.
C = 1.0, gamma = 1/784 in rbf kernel
No grid search
With two types, one-vs-one and one-vs-rest
As a caveat, the data was normalized by 0-1 (I couldn't learn well without normalization, why?)
F-value | Time | |
---|---|---|
one_versus_one | 0.930 | 1 min 39 s |
one_versus_rest | 0.922 | - |
In sparse, if there is no bias in the number of samples between classes, RandomForest is more accurate and faster! !!
The question is, one-vs-rest was slower, why? Package problem?
Impression that I tried using RandomForest, after all it is fast! !!
Perhaps because I use only SVM all the time, it feels rather fast than usual ...
It is also attractive that there are few parameters, is it okay with num_trees and max_features? ??
Next is boosting ...
Recommended Posts