Understand the Random Forest algorithm and try it with scikit-learn.
Random forest is a type of ensemble learning called bagging that combines multiple decision trees.
The no free lunch theorem originally states that in combinatorial optimization, if you apply a search algorithm to all possible problems, the average performance of all algorithms will be the same.
This is because each algorithm has its own prerequisites, and not all possible problems meet those prerequisites, so one problem may work well, another. It shows that no algorithm is better than the others for all problems, as it will perform worse than the other algorithms.
It is cited in arguing that there is no one-size-fits-all learning device that evolves from that and gives the best results for any problem in machine learning.
The no free lunch theorem mentioned above shows that no universal learner is perfect for any problem. Therefore, it is a natural idea to come up with a method of combining multiple learning devices.
The learning method of taking the majority vote of the output from multiple learners and making it the final output is called ensemble learning. The individual classifiers used for ensemble learning are called weak classifiers because they only need to perform a little better than random.
Bagging (Boostrap AGGregatING) is a typical method of ensemble learning. In bagging, as shown in the figure below, multiple classifiers are trained using the bootstrap sample of the training data, and for new data, the category is output by majority vote in the classification, and the estimated value is output by the average in the regression.
Bagging allows individual discriminators to be trained independently and in parallel, but bootstrap sampling allows duplication, so using a decision tree as a weak discriminator increases the correlation between the decision trees, and they are all similar. There is a possibility that it will become a waste.
Random forest has improved this problem.
In random forest learning, when learning a decision tree with a bootstrap sample, instead of using all the features, a specified number of features are randomly selected and the decision tree is used. To build.
In bagging, the decision tree was constructed from the bootstrap sample, but in Random Forest, the feature amount used in the bootstrap sample is randomly selected to construct the decision tree as shown in the figure below.
By randomizing the features used in each bootstrap sample in this way, each decision tree becomes diverse, and it can be expected to reduce the correlation between decision trees, which was a problem in bagging.
In scikit-learn, the argument n_estimators can specify the number of weak classifiers, and the argument max_features can specify the number of features to use. By default, the number of features used is the square root of the feature.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2
The implemented program is published on GitHub.
random_forest.py
Here is the result of applying a random forest to the breast cancer dataset that we have been using so far.
Accuracy 92.98%
Precision, Positive predictive value(PPV) 94.03%
Recall, Sensitivity, True positive rate(TPR) 94.03%
Specificity, True negative rate(TNR) 91.49%
Negative predictive value(NPV) 91.49%
F-Score 94.03%
The figure below shows the identification boundaries when performing a multiclass classification on an iris dataset.
The data of the regression problem is a sine wave plus a random number. In regression, the mean value is the final output value.
1.11.2. Forests of randomized trees
Recommended Posts