Understand the AdaBoost formula and try it with scikit-learn.
It is assumed that you have already learned calculus.
AdaBoost is a type of ensemble learning called boosting that learns multiple weak classifiers step by step.
Boosting learns multiple weak classifiers and the next weak classifier based on the learning result of the previous weak classifier. This allows the next weak classifier to focus on the data that the previous weak classifier did not train well.
Whereas bagging trains multiple weak classifiers independently in parallel, boosting trains multiple weak classifiers in series.
A typical boosting algorithm is Adaptive Boosting (AdaBoost). Adaboost weights correctly identified training data with small values and misidentified training data with large values, so that the weak classifiers in the latter half focus on error-prone training data. To be able to learn.
Here, the number of training data is $ N $, the number of weak classifiers is $ M $, the training data is $ x = (x_1, ..., x_n) $, and the correct answer data is $ t = (t_1, .. .., t_n) $ However, $ t \ in \ {-1, 1 \} $, the weight of the training data is $ w ^ m_i (i = 1, ..., N, m = 1, .. Let ., M) $, and the weak classifier be $ y_m (x) = \ {-1, 1 \} $.
AdaBoost sequentially minimizes the exponential error function $ E $ in the formula below from $ m = 1 $ to $ M $.
E = \sum^N_{i=1} \exp \left( -t_i f_m(x_i) \right)
Where $ f_m (x) $ is defined as a linear combination of the linear combination coefficient $ \ alpha_m $ and the weak classifier $ y_m (x) $ in the formula below.
f_m(x) = \sum^m_{j=1} \alpha_j y_j(x)
The purpose of AdaBoost training is to find the coefficient $ \ alpha_m $ that minimizes the exponential error function $ E $ and the weight $ w ^ m_i $ of the training data.
Where the coefficients up to $ m-1 $ th $ \ alpha_1, ..., \ alpha_ {m-1} $ and the weak classifier $ y_1 (x), ..., y_ {m-1} (x) ) Given that $ is sought, consider the $ m $ th coefficient and weak classifier.
\begin{align}
E &= \sum^N_{i=1} \exp \left( -t_i f_{m-1}(x_i) - t_i f_m(x_i) \right) \\
&= \sum^N_{i=1} \exp \left( -t_i f_{m-1}(x_i) - t_i \alpha_m y_m(x_i) \right) \\
&= \sum^N_{i=1} w^m_i \exp \left( - t_i \alpha_m y_m(x_i) \right)
\end{align}
Will be. Here, $ w ^ m_i = \ exp \ left (-t_i f_ {m-1} (x_i) \ right) $, and when treated as a constant in sequential optimization, the weak classifier $ y_m (x) $ $ Y_m (x_i) = t_i $ to $ y_m (x_i) t_i = 1 $ if identified correctly, $ y_m (x_i) \ neq t_i $ to $ y_m (x_i) t_i = -1 if identified incorrectly Because it's $
\begin{align}
E &= \sum_{y_m(x_i) = t_i} w^m_i \exp(-\alpha_m) + \sum_{y_m(x_i) \neq t_i} w^m_i \exp(\alpha_m) \\
&= \sum^N_{i=1} w^m_i \exp (-\alpha_m) + \sum_{y_m(x_i) \neq t_i} w^m_i \left\{ \exp(\alpha_m) - \exp(-\alpha_m) \right\} \
\end{align}
If $ A = \ sum_ {y_m (x_i) \ neq t_i} w ^ m_i, B = \ sum ^ N_ {i = 1} w ^ m_i $
E = \left\{ \exp(\alpha_m) - \exp(-\alpha_m) \right\} A + \exp(-\alpha_m) B
It will be. In the minimization for the coefficient $ \ alpha_m $ you want to find, if you differentiate for $ \ alpha_m $ and solve 0,
\frac{\partial E}{\partial \alpha_m} = \left\{ \exp(\alpha_m) + \exp(-\alpha_m) \right\} A - \exp(-\alpha_m) B = 0
Therefore,
\exp(2 \alpha_m) = \frac{B}{A} - 1
Therefore,
\begin{align}
\alpha_m &= \frac{1}{2} \ln \left( \frac{B}{A} - 1 \right) = \frac{1}{2} \ln \frac{1 - A/B}{A/B} \\
&= \frac{1}{2} \ln \frac{1 - \epsilon_m}{\epsilon_m}
\end{align}
It will be. However,
\epsilon_m = \frac{A}{B} = \frac{\sum_{y_m(x_i) \neq t_i} w^m_i}{\sum^n_{i=1} w^m_i}
And $ \ epsilon_m $ will be called the weighted error function.
On the other hand, the following formula is used to update the weight of the $ i $ th training data.
w^{m+1}_i = w^m_i \exp \left( - t_i \alpha_m y_m(x_i) \right)
Here, if the function that becomes 0 when correctly identified and 1 when incorrectly identified is $ I (y_m (x_i) \ neq t_i) $,
t_i y_m(x_i) = 1 - I(y_m(x_i) \neq t_i)
So
\begin{align}
w^{m+1}_i &= w^m_i \exp \left( - \alpha_m + I(y_m(x_i) \neq t_i) \right) \\
&= w^m_i \exp(-\alpha_m) \exp \left( \alpha_m I( y_m(x_i) \neq t_i) \right)
\end{align}
And since $ \ exp (-\ alpha_m) $ is common to all data, if we ignore it by normalization, the update expression of the training data weight $ w ^ m_i $ is
w^{m+1}_i = w^m_i \exp \left( I( y_m(x_i) \neq t_i) \right)
It will be.
The learning algorithm of AdaBoost is as follows.
First, initialize the training data weight $ w ^ 1_i $ as follows.
w^1_i = \frac{1}{N}
Then repeat the following steps from the number of weak classifiers $ m = 1 $ to $ M $.
Finally, when identifying the new data $ x $, use the sign function $ {\ rm sign} $ and use the following formula to identify it.
\hat{t} = {\rm sign} \left( \sum^M_{m=1} \alpha_m y_m (x) \right)
Since AdaBoost is originally a binary classification model, it is necessary to prepare $ K-1 $ classifiers for multiclass classification.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2
The implemented program is published on GitHub.
adaboost.py
Here is the result of applying AdaBoost to the breast cancer dataset as before.
Accuracy 97.37%
Precision, Positive predictive value(PPV) 97.06%
Recall, Sensitivity, True positive rate(TPR) 98.51%
Specificity, True negative rate(TNR) 95.74%
Negative predictive value(NPV) 97.83%
F-Score 97.78%
The figure below shows the identification boundaries when performing a multiclass classification on an iris dataset.
The data of the regression problem is a sine wave plus a random number.
Recommended Posts