Introduction

In this article, I will summarize the random forest algorithm.

Random forests are a combination of many decision trees, so you need to understand the decision tree algorithm first.

Please refer to the article here for the decision tree.

Random forest is one of ensemble learning. Let's talk about ensemble learning.

What is ensemble learning?

Ensemble learning is a technique that attempts to obtain better predictions by combining multiple learning machines.

In many cases, you will get better results than using a single model.

As for how to combine multiple learners, in the case of classification, the majority vote of multiple learners is taken, and in the case of regression, theaverageof multiple learners is taken.

Commonly used techniques in ensemble learning include bagging, boosting, stacking, and bumping.

Random forest can be said to be ensemble learning using a decision tree as a learner, using a technique called bagging.

A lot of terms came out and it became difficult to understand. I will explain each technique.

I referred to the article here.

About bagging

Bagging is an abbreviation for bootstrap aggregating.

Using a technique called boost trap, create several datasets from one dataset, generate one learner for each duplicated dataset, and make a majority vote of the multiple learners created in this way. Doing so makes the final prediction.

Boosttrap is a method of sampling n pieces of data from a dataset, allowing duplication.

Let the dataset be $ S_0 = (d_1, d_2, d_3, d_4, d_5) $, and when sampling n = 5 data, $ S_1 = (d_1, d_1, d_3, d_4, d_5) $ or $ S_2 = You will be creating a dataset such as (d_2, d_2, d_3, d_4, d_5) $.

As you can see, you can use Boosttrap to create many different datasets from one dataset.

Let's consider the predicted value with a concrete example.

Generate N boost trap data sets of magnitude n from the training dataset.

Create N prediction models using those data, and let each prediction value be $ y_n (X) $.

Since the average of these N predicted values is the final predicted value, the final predicted value of the model using bagging is as follows.

y(X) = \frac{1}{N}\sum_{n=1}^{N}y_n(X)

This is the end of the explanation of bagging. Next, let's look at boosting.

About boosting

In boosting, weak learners are not created independently as in bagging, but weak learners are constructed one by one. At that time, the k + 1th weak learner is constructed based on the kth weak learner (to compensate for the weakness).

Unlike bagging, which generates weak learners independently, boosting, which requires you to generate weak learners one by one, takes time. Instead, boosting tends to be more accurate than bagging.

Stacking

For bagging, we considered a simple average of N predicted values.

This algorithm evaluates individual predictions equally and does not take into account the importance of each model.

Stacking adds weights to individual predicted values according to their importance to make the final predicted value.

It is expressed by the following formula.

y(X) = \sum_{n=1}^{N}W_ny_n(X)

Bumping

Bumping is a technique for finding the best-fitting model among multiple learners.

Generate N models using the boost trap data set, apply the learner created using it to the original data, and select the one with the smallest prediction error as the best model.

This may seem like a less beneficial method, but it avoids learning with poor quality data.

About the random forest algorithm

So far we have dealt with ensemble learning.

Random forest is a method that uses bagging in ensemble learning and also uses decision tree as a base learner.

The algorithm is as follows.

Create N boost trap data sets from the training data.
Use this data set to generate N decision trees. At this time, m features are randomly selected from p features.
In the case of classification, the majority vote of N decision trees is used, and in the case of regression, the average of the predictions of N decision trees is the final prediction.

Due to 2, there is a reason to use only some features.

This is because in ensemble learning, the lower the correlation between models, the more accurate the predictions.

The image is that it is better to have people with different ideas than to have many similar people.

The boost trap already trains with different data, but by changing the features, it is possible to train with different data and lower the correlation of the model.

Random forest implementation

Now let's implement it.

This time, let's classify the data generated by make_moons in sklearn.

Let's draw the data with the following code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap
import mglearn

moons = make_moons(n_samples=200, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]
mglearn.discrete_scatter(X[:, 0], X[:, 1], Y)
plt.show()

mglearn.discrete_scatter can be drawn by taking (X coordinate, Y coordinate, correct label) arguments.

Let's draw using normal ax.plot instead of mglearn. I created the function as follows.

def plot_datasets(x, y):
    figure = plt.figure(figsize=(12, 8))
    ax = figure.add_subplot(111)
    ax.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'bo', ms=15)
    ax.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'r^', ms=15)
    ax.set_xlabel('$x_0$', fontsize=15)
    ax.set_ylabel('$x_1$', fontsize=15)


plot_datasets(X, Y)
plt.show()

bo means a blue circle andr ^means a red triangle.

Let's summarize this part. The first indicates the color, and the acronyms such as'red',' blue',' green', and'cyan' indicate the color.

The second letter indicates the shape, and's','x','o','^', and'v'are squares, crosses, circles, upper triangles, and lower triangles in order from the left.

We will classify the above data using a random forest.

Below is the code.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


def plot_dexision_boundary(model, x, y, ax, margin=0.3):
    _x = np.linspace(x[:, 0].min() - margin, x[:, 0].max() + margin, 100)
    _y = np.linspace(x[:, 1].min() - margin, x[:, 1].max() + margin, 100)
    xx, yy = np.meshgrid(_x, _y)
    X = np.hstack((xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)))
    y_pred = model.predict(X).reshape(yy.shape)
    custom_cmap = ListedColormap(['green', 'cyan'])
    ax.contourf(xx, yy, y_pred, alpha=0.3, cmap=custom_cmap)


def plot_datasets(x, y, ax):
    ax = figure.add_subplot(111)
    ax.plot(x[:, 0][y == 0], x[:, 1][y == 0], 'gs', ms=15)
    ax.plot(x[:, 0][y == 1], x[:, 1][y == 1], 'c^', ms=15)
    ax.set_xlabel('$x_0$', fontsize=15)
    ax.set_ylabel('$x_1$', fontsize=15)


moons = make_moons(n_samples=200, noise=0.2, random_state=0)
X = moons[0]
Y = moons[1]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
random_clf = RandomForestClassifier()
random_clf.fit(X_train, Y_train)
figure = plt.figure(figsize=(12, 8))
ax = figure.add_subplot(111)
plot_datasets(X, Y, ax)
plot_dexision_boundary(random_clf, X, Y, ax)
plt.show()

You can see that it is classified as a pretty good feeling.

I will explain the code.

_x = np.linspace(x[:, 0].min() - margin, x[:, 0].max() + margin, 100)
_y = np.linspace(x[:, 1].min() - margin, x[:, 1].max() + margin, 100)
xx, yy = np.meshgrid(_x, _y)

I am creating a grid point with this code. Please refer to the article here for the grid points.

Create grid points with margins more than the minimum and maximum values of the data plot range.

X = np.hstack((xx.ravel().reshape(-1, 1), yy.ravel().reshape(-1, 1)))
y_pred = model.predict(X).reshape(yy.shape)

After converting 100 × 100 data to a one-dimensional array with rabel (), it is converted to a 10000 × 1 vertical vector with reshape (-1, 1), and it is connected horizontally by p.hstack.

y_pred = model.predict (X) .reshape (yy.shape) predicts the model for 10000 × 2 data. It returns 0 on one side of the model and 1 on the other, so I'm converting it back to 100x100 data.

custom_cmap = ListedColormap(['green', 'cyan'])
ax.contourf(xx, yy, y_pred, alpha=0.3, cmap=custom_cmap)

The color used to create the contour line is specified by custom_cmap, and the contour line is drawn by ʻax.contourf (xx, yy, y_pred, alpha = 0.3, cmap = custom_cmap) `.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)
random_clf = RandomForestClassifier()
random_clf.fit(X_train, Y_train)

This code classifies the data, creates a model of a random forest, and then trains it. Now let's evaluate the predictive model with the code below.

print(random_clf.score(X_test, Y_test))

0.96

At the end

That's all for this article.

Thank you for your relationship.

[Machine learning] Let's summarize random forest in an easy-to-understand manner