[Machine learning] I tried to summarize the theory of Adaboost

Introduction

This time, I will summarize the theory about Adaboost in ensemble learning.

There are many terms in this area, so I will explain the terms used for ensemble learning first.

AdaBoost's theory will be summarized later, so skip it if you want to see it sooner.

The following articles were very helpful in writing this article.

Machine learning ⑤ AdaBoost summary

Detailed explanation of boosting and AdaBoost

First pattern recognition Chapter 11 boosting

I researched AdaBoost, so I will summarize it

Let's do our best.

What is ensemble learning?

Ensemble learning is a technique that attempts to obtain better predictions by combining multiple learning machines.

In many cases, you will get better results than using a single model.

As for how to combine multiple learners, in the case of classification, the majority vote of multiple learners is taken, and in the case of regression, theaverageof multiple learners is taken.

Commonly used techniques in ensemble learning include bagging, boosting, stacking, and bumping.

Random forest can be said to be ensemble learning using a decision tree as a learner using a technique called bagging.

A lot of terms came out and it became difficult to understand. I will explain each technique.

About bagging

Bagging is an abbreviation for bootstrap aggregating.

Using a technique called boost trap, create several datasets from one dataset, generate one learner for each duplicated dataset, and make a majority vote of the multiple learners created in this way. By doing so, you will make the final prediction.

Boosttrap is a method of sampling n pieces of data from a dataset, allowing duplication.

Let the dataset be $ S_0 = (d_1, d_2, d_3, d_4, d_5) $, and when sampling n = 5 pieces of data, $ S_1 = (d_1, d_1, d_3, d_4, d_5) $ or $ S_2 = You will create a dataset such as (d_2, d_2, d_3, d_4, d_5) $.

As you can see, you can use Boosttrap to create many different datasets from one dataset.

Let's consider the predicted value with a concrete example.

Generate N boost trap data sets of magnitude n from the training dataset.

Create N prediction models using those data, and let each prediction value be $ y_n (X) $.

Since the average of these N predicted values is the final predicted value, the final predicted value of the model using bagging is as follows.

y(X) = \frac{1}{N}\sum_{n=1}^{N}y_n(X)

This is the end of the explanation of bagging. Next, let's look at boosting.

About boosting

In boosting, weak learners are not created independently as in bagging, but weak learners are constructed one by one. At that time, the k + 1th weak learner is constructed based on the kth weak learner (to compensate for the weakness).

Unlike bagging, which generates weak learners independently, boosting, which requires you to generate weak learners one by one, takes time. Instead, boosting tends to be more accurate than bagging.

Stacking

For bagging, we considered a simple average of N predicted values.

This algorithm evaluates individual predictions equally and does not take into account the importance of each model.

Stacking adds weights to individual predicted values according to their importance to make the final predicted value.

It is expressed by the following formula.

y(X) = \sum_{n=1}^{N}W_ny_n(X)

Bumping

Bumping is a technique for finding the best model among multiple learners.

Generate N models using the boost trap data set, apply the learner created using it to the original data, and select the one with the smallest prediction error as the best model.

This may seem like a less beneficial method, but it avoids learning with poor quality data.

About AdaBoost's theory

Now let's talk about AdaBoost's theory.

Since both AdaBoost and gradient boosting are types of boosting, they are algorithms that sequentially generate weak learners.

It is roughly divided into the following four features.

In this way, AdaBoost can be said to be an algorithm that attempts to enhance the system by emphasizing misclassified data.

The image is as follows.

image.png

extracted from Alexander Ihler's youtube video

Now let's look at how to actually weight the misclassified data with mathematical formulas.

I will explain it as easily as possible, so please do your best.

Consider the case of generating M weak learners to classify N data.

The final output of the strong learning machine is as follows.

H(x) = sign(\sum_{m=1}^{M}α_my_m(x))

sign is called a sign function and is a function that returns -1 if the argument is less than 0 and 1 if the argument is greater than 0. $ y_m (x) $ is the weak learner, and $ α_m $ is the weight of each weak learner.

In other words, the weak learner is weighted and summed, and if the sum is greater than 0, 1 is returned, and if it is less than 0, -1 is returned, which is the final output. Now let's see how we determine $ y_m (x) $ and $ α_m $.

Determination of y and α

Since even weights are given to the first data, the weight coefficient {$ W_n $} of the data of $ n = 1,2,3, ..., N $ is set to $ w_n ^ {(1)} = \ frac. Initialize to {1} {N} $.

The lower right of w is the data number, and the upper right of w is the weak learning machine number. In other words, in the first weak learner, the weight of all data is $ \ frac {1} {N} $.

Repeat the following operations for all learning machines $ m = 1,2,3, ... M $.

J_m = \sum_{n=1}^{N}w_n^{(m)}I(y_m(x_n)\neq t_n)

$ w_n ^ {(m)} $ is the weight in the nth data of the mth weak learner.

$ I $ is called a support function, and when it is $ y_m (x_n) \ neq t_n $, that is, when the predicted value and the correct label do not match in one data, it returns 1 and $ y_m (x_n) = t_n $. When, that is, when the predicted value and the correct label in a certain data match, 0 is returned.

In other words, $ J_m $ is the sum of the weights of the data points that made a prediction error in a weak learning machine $ m $.

\epsilon_m = \frac{\sum_{n=1}^{N}w_n^{(m)}I(y_m(x_n)\neq t_n)}{\sum_{n=1}^{N}w_n^(m)}
\alpha_m = ln \bigl(\frac{1-\epsilon_m}{\epsilon_m} \bigr)

The denominator of $ \ epsilon_m $ is the sum of the weights of the data in the weak learner $ m $, and the numerator is the sum of the weights of the mispredicted data points in a weak learner $ m $.

In other words, $ \ epsilon_m $ can be said to be the error rate expressed as a weight.

It is illustrated below.

image.png The following formula was the final output.

H(x) = sign(\sum_{m=1}^{M}α_my_m(x))

In other words, the weight of the learner with a high error rate ($ \ epsilon_m $ is large) $ \ alpha_m $ is negative, and the weight of the learner with a low error rate ($ \ epsilon_m $ is small) is positive and added together. If is positive, $ H (x) = 1 $ is the final output, and if the sum is negative, $ H (x) = 0 $ is the final output.

Update the weights of the data points with the following formula.

w_n^{(m + 1)} = w_n^{(m)}exp \bigl(\alpha_mI(y_m(x) \neq t_n) \bigr)

If the predicted value of the $ n $ th data in the weak learner $ m $ matches the correct answer label, the data is not updated, and only if the predicted value and the correct answer label are different, $ exp (\ alpha_m) $ Multiply to update the weight.

The relationship between $ \ alpha $ and $ exp (\ alpha_m) $ is illustrated below.

image.png

On a learning machine with a high error rate of $ \ epsilon_m $, $ \ alpha_m $ will be smaller, so $ w ^ {(m + 1)} $ will be smaller.

On a learner with a low error rate of $ \ epsilon_m $, that is, an accurate learner, $ \ alpha_m $ will be large, so $ w ^ {(m + 1)} $ will be large.

In other words, in summary, in an accurate learning machine, the weight of the wrong data will be increased and passed to the next learning machine, and in an inaccurate learning machine, the weight of the wrong data will be reduced and the next learning machine Will be passed to.

The above is the theory of AdaBoost.

At the end

Thank you for reading this far.

If you feel like it, I would like to summarize the implementation.

Recommended Posts

[Machine learning] I tried to summarize the theory of Adaboost
I tried to summarize the basic form of GPLVM
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to predict the presence or absence of snow by machine learning.
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
I tried to summarize the umask command
[Linux] I tried to summarize the command of resource confirmation system
I tried to summarize the frequently used implementation method of pytest-mock
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to touch the API of ebay
I tried to correct the keystone of the image
LeetCode I tried to summarize the simple ones
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried calling the prediction API of the machine learning model from WordPress
I tried to summarize the logical way of thinking about object orientation.
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
I tried to summarize SparseMatrix
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to summarize how to use matplotlib of python
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
I didn't understand the Resize of TensorFlow so I tried to summarize it visually.
[Horse Racing] I tried to quantify the strength of racehorses
[First COTOHA API] I tried to summarize the old story
I tried to get the location information of Odakyu Bus
Try to evaluate the performance of machine learning / regression model
I tried to find the average of the sequence with TensorFlow
I tried machine learning with liblinear
Try to evaluate the performance of machine learning / classification model
I tried to summarize the code often used in Pandas
I tried machine learning to convert sentences into XX style
How to increase the number of machine learning dataset images
[Python] I tried to visualize the follow relationship of Twitter
I tried to implement ListNet of rank learning with Chainer
I tried to summarize the commands often used in business
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to move the ball
I tried to estimate the interval.
I tried to summarize how to use the EPEL repository again
Matching app I tried to take statistics of strong people & tried to create a machine learning model
I tried to summarize useful knowledge when developing / operating machine learning services [Python x Azure]
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
I tried to summarize the commands used by beginner engineers today
How to use machine learning for work? 01_ Understand the purpose of machine learning
I tried to expand the size of the logical volume with LVM
[Machine learning] I tried to do something like passing an image
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried to summarize Python exception handling