Performance evaluation of scikit-learn + Random Forest

Introduction

I usually use SVM, but I had to study the learner, so I conducted an experiment at Random Forest.

Reference material http://d.hatena.ne.jp/shakezo/20121221/1356089207 When Basics of Statistical Machine Learning (http://www.kyoritsu-pub.co.jp/bookdetail/9784320123625)

About Random Forest

Create multiple test data by bootstrap sampling and create multiple test data.

Finally, the classes are classified by a majority vote of each decision tree.

Numerical prediction is also possible if the decision tree is a regression tree.

The difference from bagging is that Random Forest also samples the objective variable at the same time.

Why is the accuracy good?

Generalization error = Bias + Variance + Noise

When

A learning tree such as a decision tree is a weak learning tree with high variance (the prediction result is significantly different if the data is slightly different).

The strategy is to lower the variance by creating a lot of learning trees like this.

Hyperparameters are mainly

num_trees: How many decision trees to create

max_depth: Which depth decision tree to create

num_features: How many objective variables to sample when sampling objective variables

Is.

The book says that random_forest does not overfit, so it does not depend on max_depth, and that num_features should be sqrt (features). Is that true? So I conducted an experiment.

Experimental conditions

I used mnist for the data.

However, I felt that the original amount of data was large, so I sampled each number 2000.

10000 test data as it is

Experiment

num_trees

First, when you change num_trees to 10, 100, 1000, 10000,

max_depth is until the end

num_features fixed to sqrt (features)

After all it seems that the accuracy is low if it is too small, but it makes no sense if it is too much, about 1000 seems to be appropriate

depth

Next time you change the depth

10 types of 2, 4, 6, 8, 10, 12, 14, 16, 18, 20

num_trees fixed to 1000, max_features fixed to sqrt (features)

Certainly, even if you search deeply, it seems that overfitting will not occur, it is okay to search deeply

num_features

When changing num_features to 10, 15, 20, 25, 30, 35, 40, 45, 50, 55,

num_trees up to 1000, depth up to max

Is 35 or 40 the peak? The red line is the default sprt,

Was it better this time to have a little more than the default?

Execution time

Finally about the execution time

num_trees=1000 max_depth = max max_features = 40 n_jobs = 6

Experiment with only CPU

The average of 3 times is 28.8 seconds

Comparison method

As a benchmark, SVM also classified MNIST.

C = 1.0, gamma = 1/784 in rbf kernel

No grid search

With two types, one-vs-one and one-vs-rest

As a caveat, the data was normalized by 0-1 (I couldn't learn well without normalization, why?)

	F-value	Time
one_versus_one	0.930	1 min 39 s
one_versus_rest	0.922	-

In sparse, if there is no bias in the number of samples between classes, RandomForest is more accurate and faster! !!

The question is, one-vs-rest was slower, why? Package problem?

Conclusion

Impression that I tried using RandomForest, after all it is fast! !!

Perhaps because I use only SVM all the time, it feels rather fast than usual ...

It is also attractive that there are few parameters, is it okay with num_trees and max_features? ??

Next is boosting ...