Introduction

This time, I will summarize the theory about Adaboost in ensemble learning.

There are many terms in this area, so I will explain the terms used for ensemble learning first.

AdaBoost's theory will be summarized later, so skip it if you want to see it sooner.

The following articles were very helpful in writing this article.

Machine learning ⑤ AdaBoost summary

Detailed explanation of boosting and AdaBoost

First pattern recognition Chapter 11 boosting

I researched AdaBoost, so I will summarize it

Let's do our best.

What is ensemble learning?

Ensemble learning is a technique that attempts to obtain better predictions by combining multiple learning machines.

In many cases, you will get better results than using a single model.

As for how to combine multiple learners, in the case of classification, the majority vote of multiple learners is taken, and in the case of regression, theaverageof multiple learners is taken.

Commonly used techniques in ensemble learning include bagging, boosting, stacking, and bumping.

Random forest can be said to be ensemble learning using a decision tree as a learner using a technique called bagging.

A lot of terms came out and it became difficult to understand. I will explain each technique.

About bagging

Bagging is an abbreviation for bootstrap aggregating.

Using a technique called boost trap, create several datasets from one dataset, generate one learner for each duplicated dataset, and make a majority vote of the multiple learners created in this way. By doing so, you will make the final prediction.

Boosttrap is a method of sampling n pieces of data from a dataset, allowing duplication.

Let the dataset be $ S_0 = (d_1, d_2, d_3, d_4, d_5) $, and when sampling n = 5 pieces of data, $ S_1 = (d_1, d_1, d_3, d_4, d_5) $ or $ S_2 = You will create a dataset such as (d_2, d_2, d_3, d_4, d_5) $.

As you can see, you can use Boosttrap to create many different datasets from one dataset.

Let's consider the predicted value with a concrete example.

Generate N boost trap data sets of magnitude n from the training dataset.

Create N prediction models using those data, and let each prediction value be $ y_n (X) $.

Since the average of these N predicted values is the final predicted value, the final predicted value of the model using bagging is as follows.

y(X) = \frac{1}{N}\sum_{n=1}^{N}y_n(X)

This is the end of the explanation of bagging. Next, let's look at boosting.

About boosting

In boosting, weak learners are not created independently as in bagging, but weak learners are constructed one by one. At that time, the k + 1th weak learner is constructed based on the kth weak learner (to compensate for the weakness).

Unlike bagging, which generates weak learners independently, boosting, which requires you to generate weak learners one by one, takes time. Instead, boosting tends to be more accurate than bagging.

Stacking

For bagging, we considered a simple average of N predicted values.

This algorithm evaluates individual predictions equally and does not take into account the importance of each model.

Stacking adds weights to individual predicted values according to their importance to make the final predicted value.

It is expressed by the following formula.

y(X) = \sum_{n=1}^{N}W_ny_n(X)

Bumping

Bumping is a technique for finding the best model among multiple learners.

Generate N models using the boost trap data set, apply the learner created using it to the original data, and select the one with the smallest prediction error as the best model.

This may seem like a less beneficial method, but it avoids learning with poor quality data.

About AdaBoost's theory

Now let's talk about AdaBoost's theory.

Since both AdaBoost and gradient boosting are types of boosting, they are algorithms that sequentially generate weak learners.

It is roughly divided into the following four features.

Use weighted data sets in training weak learners
Gives a large weight to data points misclassified by the previous learner
Initially give equal weight to all data
Output the final predicted value by combining the weak learning machines generated and trained in this way

In this way, AdaBoost can be said to be an algorithm that attempts to enhance the system by emphasizing misclassified data.

The image is as follows.

extracted from Alexander Ihler's youtube video

Now let's look at how to actually weight the misclassified data with mathematical formulas.

I will explain it as easily as possible, so please do your best.

Consider the case of generating M weak learners to classify N data.

The final output of the strong learning machine is as follows.

H(x) = sign(\sum_{m=1}^{M}α_my_m(x))

sign is called a sign function and is a function that returns -1 if the argument is less than 0 and 1 if the argument is greater than 0. $ y_m (x) $ is the weak learner, and $ α_m $ is the weight of each weak learner.

In other words, the weak learner is weighted and summed, and if the sum is greater than 0, 1 is returned, and if it is less than 0, -1 is returned, which is the final output. Now let's see how we determine $ y_m (x) $ and $ α_m $.

Determination of y and α

Since even weights are given to the first data, the weight coefficient {$ W_n $} of the data of $ n = 1,2,3, ..., N $ is set to $ w_n ^ {(1)} = \ frac. Initialize to {1} {N} $.

The lower right of w is the data number, and the upper right of w is the weak learning machine number. In other words, in the first weak learner, the weight of all data is $ \ frac {1} {N} $.

Repeat the following operations for all learning machines $ m = 1,2,3, ... M $.

Define the following as an error function and minimize it.

J_m = \sum_{n=1}^{N}w_n^{(m)}I(y_m(x_n)\neq t_n)

$ w_n ^ {(m)} $ is the weight in the nth data of the mth weak learner.

$ I $ is called a support function, and when it is $ y_m (x_n) \ neq t_n $, that is, when the predicted value and the correct label do not match in one data, it returns 1 and $ y_m (x_n) = t_n $. When, that is, when the predicted value and the correct label in a certain data match, 0 is returned.

In other words, $ J_m $ is the sum of the weights of the data points that made a prediction error in a weak learning machine $ m $.

Calculate α using the following formula

\epsilon_m = \frac{\sum_{n=1}^{N}w_n^{(m)}I(y_m(x_n)\neq t_n)}{\sum_{n=1}^{N}w_n^(m)}

\alpha_m = ln \bigl(\frac{1-\epsilon_m}{\epsilon_m} \bigr)

The denominator of $ \ epsilon_m $ is the sum of the weights of the data in the weak learner $ m $, and the numerator is the sum of the weights of the mispredicted data points in a weak learner $ m $.

In other words, $ \ epsilon_m $ can be said to be the error rate expressed as a weight.

It is illustrated below.

The following formula was the final output.

H(x) = sign(\sum_{m=1}^{M}α_my_m(x))

In other words, the weight of the learner with a high error rate ($ \ epsilon_m $ is large) $ \ alpha_m $ is negative, and the weight of the learner with a low error rate ($ \ epsilon_m $ is small) is positive and added together. If is positive, $ H (x) = 1 $ is the final output, and if the sum is negative, $ H (x) = 0 $ is the final output.

Update data point weights

Update the weights of the data points with the following formula.

w_n^{(m + 1)} = w_n^{(m)}exp \bigl(\alpha_mI(y_m(x) \neq t_n) \bigr)

If the predicted value of the $ n $ th data in the weak learner $ m $ matches the correct answer label, the data is not updated, and only if the predicted value and the correct answer label are different, $ exp (\ alpha_m) $ Multiply to update the weight.

The relationship between $ \ alpha $ and $ exp (\ alpha_m) $ is illustrated below.

On a learning machine with a high error rate of $ \ epsilon_m $, $ \ alpha_m $ will be smaller, so $ w ^ {(m + 1)} $ will be smaller.

On a learner with a low error rate of $ \ epsilon_m $, that is, an accurate learner, $ \ alpha_m $ will be large, so $ w ^ {(m + 1)} $ will be large.

In other words, in summary, in an accurate learning machine, the weight of the wrong data will be increased and passed to the next learning machine, and in an inaccurate learning machine, the weight of the wrong data will be reduced and the next learning machine Will be passed to.

The above is the theory of AdaBoost.

At the end

Thank you for reading this far.

If you feel like it, I would like to summarize the implementation.

[Machine learning] I tried to summarize the theory of Adaboost