Introduction

Here is a summary of the original boosting such as XGBoost, which is often used in Kaggle. Among the boosting algorithms, I would like to understand Adaboost while implementing it. 　 I used this article by @amber_kshz and implemented it.

Implemented PRML Chapter 14 Bagging and AdaBoost in Python https://qiita.com/amber_kshz/items/8869d8ef7f9743b16d8d

** In particular, I was able to learn how to calculate efficiently and quickly using an array. ** I would like to summarize how to do that.

The main points are as follows.

Understand Boosting
Understand Adaboost's algorithm
Deepen understanding by implementing Adaboost
Understand Decision Stump
Try it out + Evaluate the difference due to the number of trials

Understand Boosting

The method of combining learning devices called weak learners is called ** ensemble learning **. Typical ensemble learning can be roughly divided into four.

Bagging (discrimination from training data by multiple weak learners and averaging (sampling ** with duplication **))
Paceting (discriminate from training data with multiple weak learners and average (sampling ** no duplication **))
Stacking (load averaging considering the importance of weak learners)
** (This time) Boosting (improve the accuracy of weak learners to make strong learners) **

In the ensemble learning, the algorithm that continuously learns the weak learner and modifies it to improve the accuracy is called ** boosting **. We aim to make the learning device more accurate by paying attention to the wrong (= misclassified) sample and performing the next learning.

The figure below shows the flow. $ M $ is the number of trials, and the number of trials increases from the upper left to the lower right. I would like to classify blue and red circles, but the misclassified samples are largely ** circled **. You can see that the green classification line moves to somehow separate the misclassified sample.

A similar learner is ** Bagging **. For bagging, the results of independent learning are averaged. While each of these learners is independent, the difference is that in the case of boosting, the results learned earlier are used in later learning.

Understand Adaboost's algorithm

Now, let's understand the algorithm of Adaboost. Adaboost is an abbreviation for Adaptive Boosting. This is an algorithm announced in 1997. Adaboost applies the idea of using the new predictor to make corrections that pay extra attention to the parts that were misjudged by the previous predictor.

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting https://www.sciencedirect.com/science/article/pii/S002200009791504X

The specific algorithm is as follows.

** Premise **

$ N \ in \ mathbb {N} $: Number of sample data
$ X = (x_0, x_1, \ dots, x_ {N-1}) $: Sample data points. $ x_n \ in \ mathbb {R} ^ D $.

algorithm

Input: The training data is $ (X, T) $, the base classifier is $ y $, and the number of classifiers is $ M $.
Initialize the training data weight $ (w_0, w_1, \ dots, w_ {N-1}) $ to $ w ^ {(0)} _ {n} = 1 / N $.
Repeat the following for $ m = 0,1, .... M-1 $. (a) The classifier $ y_m (x) $ is represented below. The error function $ J_m $ is adjusted to the test data so as to be minimized. $ \begin{align} J_m = \sum_{n=0}^{N-1} w^{(m)}\_{n} I \left[ y(x_n, \theta_m) \neq t_n \right] \end{align} $

⇒ At this time, $ I \ left [y (x_n, \ theta_m) \ neq t_n \ right] $ is ** 0 (= correct answer) if the test data and the predicted value match, and 1 if wrong. I will. Therefore, if the number of correct answers is large, $ J_m $ will be small ** (= it plays the role of an error function and is OK).

(b) Calculate $ \ epsilon_m and \ alpha_m $ from the error of the base classifier. $ \ Epsilon_m $ represents the error rate, and $ \ alpha_m $ is the coefficient to be reflected in the weight parameter. $ \begin{align} \varepsilon_m &:= \frac{\sum_{n=0}^{N-1} w^{(m)}\_{n} I \left[ y(x_n, \theta_m) \neq t_n \right]}{\sum_{n=0}^{N-1} w^{(m)}\_{n}} \\\ \alpha_m &:= \log \left(\frac{1 - \varepsilon_m}{\varepsilon_m} \right) \end{align} $

(c) Use the $ \ alpha_m $ found in the previous step to set the weight $ w $ $ \begin{align} \tilde{w}^{(m+1)}\_{n} &= w^{(m)}\_{n} \exp\left\\{ \alpha_m I \left[ y(x_n, \theta_m) \neq t_n \right] \right\\} \\\ \end{align} $ Update like.

⇒ At this time, if the predicted value and the test data match, $ I $ indicates $ 0 $. On the contrary, if it is wrong, it will be $ 1 $, so the weighting will be updated together with the previous $ \ alpha $.

Deepen your understanding by implementing Adaboost

Now, I would like to actually move on to implementation.

Understand Decision Stump

This time, the weak learner uses the determined strain. A deciding stock is to select one of the input variables of ** certain data, set a threshold value according to the value, and classify. ** In short, I understand that it is a decision tree with a depth of 1. Now, the program below defines the determined stock class.

`adaboost.ipynb`



    def fit_onedim(self, X, y, sample_weight, axis):

        N = len(X)

        # Here we sort everything according the axis-th coordinate of X
        sort_ind = np.argsort(X[:, axis])
        sorted_label = y[sort_ind]
        sorted_input = X[sort_ind]
        sorted_sample_weight = sample_weight[sort_ind]

        pred = -2*np.tri(N-1, N, k=0, dtype='int') + 1 
        mistakes = (pred != sorted_label ).astype('int')

        # The (weighted) error is calculated for each classifier
        errs = np.zeros((2, N-1))
        errs[0] = mistakes @ sorted_sample_weight
        errs[1] = (1 - mistakes) @ sorted_sample_weight

        # Here, we select the best threshold and sign
        ind = np.unravel_index(np.argmin(errs, axis=None), errs.shape)
        sign = -2*ind[0] + 1
        threshold = ( sorted_input[ind[1], axis] + sorted_input[ind[1] + 1, axis] ) / 2
        err = errs[ind]
        return sign, threshold, err

Learn more about calculations

This time, as shown below, we are using the makemoons dataset. Consider the case where the number of sample data is $ 100 $ as shown in the image below, the data with $ 1 $ label is $ 50 $, and the data with $ -1 $ label is $ 50 $.

This dataset is stored in $ X = (x_1, x_2) $. For convenience, define the index before sorting in ascending order by the following np.argsort.

Next, perform the np.tri method (upper and lower triangular matrix generation) as the matrix pred that gives a tentative prediction value.

Then define this pred-y as a mistake matrix. As a result, 99 sets of errata were created.

Finally, multiply this errata with the weight to find the loss function.

The boundary line is the average value of $ x_1 ^ m, x_1 ^ {m + 1} $, which is the boundary when the minimum value of this loss function is taken. In the case of Adaboost, there is also a process to update the weight for the incorrect label here.

(Supplement) Determine if they match

Here are some of the things I thought were a good way to do this. Whether the predicted value is correct or not is determined by the truth value (bool type). In the contents of the error function earlier, it was $ 0 $ if it was hit, and $ 1 $ if it was not, so we will work it that way.

`sample.ipynb`


a = np.array([[1,1,1,1],[1,1,1,1],[1,1,0,1]])
b = np.array([1,1,1,0])
c1 = (a != b )
c2 = (a != b ).astype('int')

print(c1)
print(c2)

[[False False False  True]
 [False False False  True]
 [False False  True  True]]
[[0 0 0 1]
 [0 0 0 1]
 [0 0 1 1]]

Here is a simple example. Let $ a $ be the predicted value and $ b $ be the correct label to determine. You can see that the program returns $ 0 $ for False this time, and $ 1 $ for True if it is off.

Actually move + evaluate the difference due to the number of trials

Let's actually move it. Try going from $ 1 to $ 9 as the number of trials.

`adaboost.ipynb`



num_clf_list = [1, 2, 3, 4, 5, 6,7,8,9]
clfs = [AdaBoost(DecisionStump, num_clfs=M) for M in num_clf_list]

fig, axes = plt.subplots(3, 3, figsize=(15,15))

for i in range(len(clfs)):
    clfs[i].fit(X, y)
    plot_result(ax=np.ravel(axes)[i], clf=clfs[i], xx=xx, yy=yy, X=X, y=y)
    np.ravel(axes)[i].set_title(f'M = {num_clf_list[i]}')

You can see that the line to classify gradually separates, and it seems that it will become an accurate classifier. In the case of a determined stock, as you can see from the definition, it is difficult to classify with a non-linear curve because it tries to classify with a straight line.

At the end

This time, I deepened my understanding of Adaboost with implementation. Although the algorithm itself is easy to understand, when it comes to implementing it, I found that I could not proceed without a good understanding of matrix calculations such as numpy. From here, I would like to improve my analytical skills by understanding algorithms that are often used in Kaggle such as XG boost.

The full program is here. https://github.com/Fumio-eisan/adaboost_20200427

I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)

Introduction

Understand Boosting

Understand Adaboost's algorithm

Deepen your understanding by implementing Adaboost

Understand Decision Stump

adaboost.ipynb

Learn more about calculations

(Supplement) Determine if they match

sample.ipynb

Actually move + evaluate the difference due to the number of trials

adaboost.ipynb

At the end

`adaboost.ipynb`

`sample.ipynb`

`adaboost.ipynb`