Target

Understand the AdaBoost formula and try it with scikit-learn.

It is assumed that you have already learned calculus.

theory

AdaBoost is a type of ensemble learning called boosting that learns multiple weak classifiers step by step.

Boosting

Boosting learns multiple weak classifiers and the next weak classifier based on the learning result of the previous weak classifier. This allows the next weak classifier to focus on the data that the previous weak classifier did not train well.

Whereas bagging trains multiple weak classifiers independently in parallel, boosting trains multiple weak classifiers in series.

Classification by AdaBoost

A typical boosting algorithm is Adaptive Boosting (AdaBoost). Adaboost weights correctly identified training data with small values and misidentified training data with large values, so that the weak classifiers in the latter half focus on error-prone training data. To be able to learn.

Derivation of AdaBoost

Here, the number of training data is $ N $, the number of weak classifiers is $ M $, the training data is $ x = (x_1, ..., x_n) $, and the correct answer data is $ t = (t_1, .. .., t_n) $ However, $ t \ in \ {-1, 1 \} $, the weight of the training data is $ w ^ m_i (i = 1, ..., N, m = 1, .. Let ., M) $, and the weak classifier be $ y_m (x) = \ {-1, 1 \} $.

AdaBoost sequentially minimizes the exponential error function $ E $ in the formula below from $ m = 1 $ to $ M $.

E = \sum^N_{i=1} \exp \left( -t_i f_m(x_i) \right)

Where $ f_m (x) $ is defined as a linear combination of the linear combination coefficient $ \ alpha_m $ and the weak classifier $ y_m (x) $ in the formula below.

f_m(x) = \sum^m_{j=1} \alpha_j y_j(x)

The purpose of AdaBoost training is to find the coefficient $ \ alpha_m $ that minimizes the exponential error function $ E $ and the weight $ w ^ m_i $ of the training data.

Where the coefficients up to $ m-1 $ th $ \ alpha_1, ..., \ alpha_ {m-1} $ and the weak classifier $ y_1 (x), ..., y_ {m-1} (x) ) Given that $ is sought, consider the $ m $ th coefficient and weak classifier.

\begin{align}
E &= \sum^N_{i=1} \exp \left( -t_i f_{m-1}(x_i) - t_i f_m(x_i) \right) \\
&= \sum^N_{i=1} \exp \left( -t_i f_{m-1}(x_i) - t_i \alpha_m y_m(x_i) \right) \\
&= \sum^N_{i=1} w^m_i \exp \left( - t_i \alpha_m y_m(x_i) \right)
\end{align}

Will be. Here, $ w ^ m_i = \ exp \ left (-t_i f_ {m-1} (x_i) \ right) $, and when treated as a constant in sequential optimization, the weak classifier $ y_m (x) $ $ Y_m (x_i) = t_i $ to $ y_m (x_i) t_i = 1 $ if identified correctly, $ y_m (x_i) \ neq t_i $ to $ y_m (x_i) t_i = -1 if identified incorrectly Because it's $

\begin{align}
E &= \sum_{y_m(x_i) = t_i} w^m_i \exp(-\alpha_m) + \sum_{y_m(x_i) \neq t_i} w^m_i \exp(\alpha_m) \\
&= \sum^N_{i=1} w^m_i \exp (-\alpha_m) + \sum_{y_m(x_i) \neq t_i} w^m_i \left\{ \exp(\alpha_m) - \exp(-\alpha_m) \right\} \
\end{align}

If $ A = \ sum_ {y_m (x_i) \ neq t_i} w ^ m_i, B = \ sum ^ N_ {i = 1} w ^ m_i $

E = \left\{ \exp(\alpha_m) - \exp(-\alpha_m) \right\} A + \exp(-\alpha_m) B

It will be. In the minimization for the coefficient $ \ alpha_m $ you want to find, if you differentiate for $ \ alpha_m $ and solve 0,

\frac{\partial E}{\partial \alpha_m} = \left\{ \exp(\alpha_m) + \exp(-\alpha_m) \right\} A - \exp(-\alpha_m) B = 0

Therefore,

\exp(2 \alpha_m) = \frac{B}{A} - 1

Therefore,

\begin{align}
\alpha_m &= \frac{1}{2} \ln \left( \frac{B}{A} - 1 \right) = \frac{1}{2} \ln \frac{1 - A/B}{A/B} \\
&= \frac{1}{2} \ln \frac{1 - \epsilon_m}{\epsilon_m}
\end{align}

It will be. However,

\epsilon_m = \frac{A}{B} = \frac{\sum_{y_m(x_i) \neq t_i} w^m_i}{\sum^n_{i=1} w^m_i}

And $ \ epsilon_m $ will be called the weighted error function.

On the other hand, the following formula is used to update the weight of the $ i $ th training data.

w^{m+1}_i = w^m_i \exp \left( - t_i \alpha_m y_m(x_i) \right)

Here, if the function that becomes 0 when correctly identified and 1 when incorrectly identified is $ I (y_m (x_i) \ neq t_i) $,

t_i y_m(x_i) = 1 - I(y_m(x_i) \neq t_i)

\begin{align}
w^{m+1}_i &= w^m_i \exp \left( - \alpha_m + I(y_m(x_i) \neq t_i) \right) \\
&= w^m_i \exp(-\alpha_m) \exp \left( \alpha_m I( y_m(x_i) \neq t_i) \right)
\end{align}

And since $ \ exp (-\ alpha_m) $ is common to all data, if we ignore it by normalization, the update expression of the training data weight $ w ^ m_i $ is

w^{m+1}_i = w^m_i \exp \left( I( y_m(x_i) \neq t_i) \right)

It will be.

Learning AdaBoost

The learning algorithm of AdaBoost is as follows.

First, initialize the training data weight $ w ^ 1_i $ as follows.

w^1_i = \frac{1}{N}

Then repeat the following steps from the number of weak classifiers $ m = 1 $ to $ M $.

Learn the weak classifier $ y_m $ using the training data weight $ w ^ m_i $.
Find the weighted error function $ \ epsilon_m $ from the weak classifier $ y_m $.
Find the linear combination coefficient $ \ alpha_m $.
Update the training data weight $ w ^ m_i $.

Finally, when identifying the new data $ x $, use the sign function $ {\ rm sign} $ and use the following formula to identify it.

\hat{t} = {\rm sign} \left( \sum^M_{m=1} \alpha_m y_m (x) \right)

Since AdaBoost is originally a binary classification model, it is necessary to prepare $ K-1 $ classifiers for multiclass classification.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2

Program to run

The implemented program is published on GitHub.

`adaboost.py`

result

Classification by AdaBoost

Here is the result of applying AdaBoost to the breast cancer dataset as before.

Accuracy 97.37%
Precision, Positive predictive value(PPV) 97.06%
Recall, Sensitivity, True positive rate(TPR) 98.51%
Specificity, True negative rate(TNR) 95.74%
Negative predictive value(NPV) 97.83%
F-Score 97.78%

The figure below shows the identification boundaries when performing a multiclass classification on an iris dataset.

Regression with AdaBoost

The data of the regression problem is a sine wave plus a random number.

reference

1.11.3. AdaBoost

Yoav Freund and Robert E. Schapire. "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting", Journal of computer and system sciences 55.1 (1997): pp. 119-139.
Ji Zhu, Hui Zou, Saharon Rosset, and Trevor Hastie. "Multi-class AdaBoost", Statistics and its Interface, 2(3), 2009, pp. 349-360.
Harris Drucker. "Improving Regressors using Boosting Techniques", ICML, 1997, pp. 107-115.
Yuzo Hirai. "First Pattern Recognition", Morikita Publishing, 2012.

Machine Learning: Supervised --AdaBoost

Target

theory

Boosting

Classification by AdaBoost

Derivation of AdaBoost

Learning AdaBoost

Implementation

Execution environment

hardware

software

Program to run

adaboost.py

result

Classification by AdaBoost

Regression with AdaBoost

reference

`adaboost.py`