I've summarized the random forest algorithm so that you can understand it in points. No formula is used. The description is based on the assumption that you have an understanding of decision trees, so if you would like to know the algorithm of decision trees, please refer to here.
In understanding Random Forest, I referred to the following. Although it is in English, it is explained in detail in an easy-to-understand manner, so please refer to it if you want to understand more about Random Forest.
Random forest is simply an algorithm that creates a lot of decision trees and takes a majority vote. It has the following features and is very excellent.
--Very good accuracy --Has the effect of suppressing overfitting --Thousands of input variables can be handled as they are without being deleted --No need to pre-scale each variable --Accuracy for unknown data can be estimated within the algorithm without cross-validation or validation with separate test data. --Since it has a method to estimate missing values, accuracy can be maintained even for data with many missing values. --Get information about the relationships between the data
In this article, I will explain the points of the random forest algorithm, including why it has such characteristics.
Random forest algorithms can be broadly divided into four points. I think that if you know these four, you can understand the outline of Random Forest.
--Boot strapping --Narrow down the variables used --Bagging --OOB verification
Random forest is a collection of decision trees. We make a lot of decision trees to make a forest, but there is a point in how to make each tree. The first is how to select training data when creating a decision tree. In Random Forest, when creating each decision tree, ** allow duplication and sampling ** from all the data, and use that data as training data when creating the decision tree. This technique is called ** bootstrapping **.
Of course, the same data may be included in the training data because the sampling is performed with duplication allowed.
It is also important to narrow down the variables used when creating nodes (branches) in each decision tree. If there are $ p $ variables in the total amount of data, instead of using all of them to create each node, $ m $ variables are randomly selected and used. As the number, a number of about $ m = \ sqrt {p} $ is often used.
We will create a large number of decision trees by repeating the process of "① → ②". Usually you will create 100 or more trees. (The default value of sklearn's random forest is 100) By repeating the process of "① → ②", various trees are created, and the variety makes the random forest more effective.
** In Random Forest, the final output is the result of the majority vote by each tree **. The method of selecting the training data to be used for each learner by bootstrapping, using that learner for prediction, and finally ensemble it is called ** Bagging **, and random forest is one of them. (Bagging stands for bootstrap aggregating)
Bagging itself has the effect of reducing dispersion and avoiding overfitting, so it is used in many places other than random forests.
When you sample data using the bootstrapping method, you will always get ** unselected data **. Approximately 1/3 of the original data remains, and that data is called ** OOB (out-of-bag) data **.
It is also a big point that this OOB data ** can estimate the accuracy for unknown data without cross-validation or separate test data **.
Classify the OOB data using the created random forest and check the classification error of the classification (in the case of regression, check the index such as the least squares error). In the case of classification, the percentage of such misclassified OOB data is called an OOB error (out-of-bag-error).
We aim to maximize accuracy by adjusting the number of variables used (point ②) according to the rate of this ** OOB error. ** **
The following are also introduced as points of Random Forest.
--Handling of missing values and ** proximity matrix ** (Proximities)
It's often not mentioned in the description of Random Forest (I didn't even know it at first), but I think it's a big factor in Random Forest being an effective algorithm.
A major feature of Random Forest is that it can handle missing values without problems. If the training data has missing values, the missing values are estimated by the following steps.
(1) For the time being, use the average value of other data as the estimated value of the missing value. ↓ ② Create a model of random forest ↓ ③ ** Re-estimate missing values using the Proximity matrix **
The proximity matrix is a little confusing, but it is created as follows. After creating a random forest, run all the data (including the data used for training and OOB data) on the random forest. In each decision tree in the random forest, it branches and outputs the classification (or regression) result, but the data where the end nodes are in the same place is set to proximity +1. Add to the matrix of the number of data x the number of data to create a proximity matrix.
The proximity is calculated for each decision tree in the random forest, divided by the number of trees created last, and standardized to form the proximity matrix used to estimate the missing values.
Random forest uses this proximity matrix to estimate missing values. First, create a random forest with data that is filled with missing values on average. The missing value is estimated again using the proximity matrix obtained there.
The missing value is estimated again using this proximity as a weight.
Estimate the missing value more accurately by repeating the above.
This ** proximity matrix can be used not only to estimate missing values, but also to represent relationships between data. ** (1-Proximity matrix) can be said to be a matrix that expresses the distance between data, so you can express the relationship between data by making them into a heat map or a two-dimensional plot. (Note that the random forest in the sklearn library does not have a function to output the proximity matrix.)
Below, we will try running a random forest using sklearn. If you are interested, please check Reference as detailed parameter explanations will not be given.
Let's create a random forest using the iris dataset. Separate the data from training data and test data, and create a random forest using the training data. The test data is used for accuracy confirmation. The accuracy using OOB data is also output, and as described in the characteristics of the random forest mentioned above, check whether the accuracy is close to that for unknown data.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
data = iris['data']
target = iris['target']
X_train, X_test ,Y_train, Y_test = train_test_split(data, target, test_size = 0.7, shuffle = True, random_state = 42)
rf = RandomForestClassifier(oob_score =True)
rf.fit(X_train, Y_train)
print('test_data_accuracy:' + str(rf.score(X_test, Y_test)))
print('oob_data_accuracy:' + str(rf.score(X_train, rf.oob_decision_function_.argmax(axis = 1))))
The output result is here.
test_data_accuracy:0.9428571428571428
oob_data_accuracy:0.9555555555555556
The accuracy was higher when OOB data was verified, but the test data is not the result of cross-validation, so it may be within the margin of error. Also, when the total amount of data is small, the reliability of accuracy when verifying with OOB data seems to be low.
Next Next, I would like to publish an article summarizing gradient boosting.