Performance evaluation of scikit-learn + Random Forest part2

Introduction

Last time, we evaluated the performance of homogeneous data, so this time we trained on variable data.

The data used is the same as last time, mnist

Last time, 2000 samples were extracted for each character,

This time, 1100, 1300, 1500, 1700, 1900, 2100, 2300, 2500, 2700, 2900 samples were extracted in order from 0.

The test data is also 10000 homogeneous data.

Variables to change

--Number of trees --Exploration depth --Number of features

3 types.

First, the number of trees is changed to four types: 10, 100, 1000, and 10000.

The result is shown below

Even if you look at the value with the best accuracy last time was about 0965 ?, the accuracy has decreased slightly, but the tendency is the same.

I think it's enough to have about 1000

Next, about the search for depth

This is learned by changing from 2 to 20 as before.

The number of trees is 1000, the number of features is sqrt (features)

The result is shown below

This is also the same as the last time, the accuracy is almost the same, and overfitting does not occur even if you search deeply.

Finally features

Change from 10 to 55

The number of trees is 1000, depth is fixed at max

Since the time of sqrt is 28, is it better to use less than that this time?

However, since the order difference is 0.001, it may be said that there is no big difference if it is 20 or more.

Finally the result of SVM for comparison

C = 1.0, gamma = 1/784 in RBF kernel

After all, Ramdom Forest is more accurate, but

The accuracy of SVM is higher than last time ...?

It's possible considering that it's sampling randomly,

Considering that the accuracy of Random Forest was reduced by about 0.05,

Perhaps SVM is more resistant to data variation ...?

MNIST is too accurate to be evaluated very much ...