Here is a summary of the original boosting such as XGBoost, which is often used in Kaggle. Among the boosting algorithms, I would like to understand Adaboost while implementing it. I used this article by @amber_kshz and implemented it.
Implemented PRML Chapter 14 Bagging and AdaBoost in Python https://qiita.com/amber_kshz/items/8869d8ef7f9743b16d8d
** In particular, I was able to learn how to calculate efficiently and quickly using an array. ** I would like to summarize how to do that.
The main points are as follows.
The method of combining learning devices called weak learners is called ** ensemble learning **. Typical ensemble learning can be roughly divided into four.
In the ensemble learning, the algorithm that continuously learns the weak learner and modifies it to improve the accuracy is called ** boosting **. We aim to make the learning device more accurate by paying attention to the wrong (= misclassified) sample and performing the next learning.
The figure below shows the flow. $ M $ is the number of trials, and the number of trials increases from the upper left to the lower right. I would like to classify blue and red circles, but the misclassified samples are largely ** circled **. You can see that the green classification line moves to somehow separate the misclassified sample.
A similar learner is ** Bagging **. For bagging, the results of independent learning are averaged. While each of these learners is independent, the difference is that in the case of boosting, the results learned earlier are used in later learning.
Now, let's understand the algorithm of Adaboost. Adaboost is an abbreviation for Adaptive Boosting. This is an algorithm announced in 1997. Adaboost applies the idea of using the new predictor to make corrections that pay extra attention to the parts that were misjudged by the previous predictor.
A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting https://www.sciencedirect.com/science/article/pii/S002200009791504X
The specific algorithm is as follows.
** Premise **
algorithm
⇒ At this time, $ I \ left [y (x_n, \ theta_m) \ neq t_n \ right] $ is ** 0 (= correct answer) if the test data and the predicted value match, and 1 if wrong. I will. Therefore, if the number of correct answers is large, $ J_m $ will be small ** (= it plays the role of an error function and is OK).
(b) Calculate $ \ epsilon_m and \ alpha_m $ from the error of the base classifier. $ \ Epsilon_m $ represents the error rate, and $ \ alpha_m $ is the coefficient to be reflected in the weight parameter.
(c) Use the $ \ alpha_m $ found in the previous step to set the weight $ w $
⇒ At this time, if the predicted value and the test data match, $ I $ indicates $ 0 $. On the contrary, if it is wrong, it will be $ 1 $, so the weighting will be updated together with the previous $ \ alpha $.
Now, I would like to actually move on to implementation.
This time, the weak learner uses the determined strain. A deciding stock is to select one of the input variables of ** certain data, set a threshold value according to the value, and classify. ** In short, I understand that it is a decision tree with a depth of 1. Now, the program below defines the determined stock class.
adaboost.ipynb
def fit_onedim(self, X, y, sample_weight, axis):
N = len(X)
# Here we sort everything according the axis-th coordinate of X
sort_ind = np.argsort(X[:, axis])
sorted_label = y[sort_ind]
sorted_input = X[sort_ind]
sorted_sample_weight = sample_weight[sort_ind]
pred = -2*np.tri(N-1, N, k=0, dtype='int') + 1
mistakes = (pred != sorted_label ).astype('int')
# The (weighted) error is calculated for each classifier
errs = np.zeros((2, N-1))
errs[0] = mistakes @ sorted_sample_weight
errs[1] = (1 - mistakes) @ sorted_sample_weight
# Here, we select the best threshold and sign
ind = np.unravel_index(np.argmin(errs, axis=None), errs.shape)
sign = -2*ind[0] + 1
threshold = ( sorted_input[ind[1], axis] + sorted_input[ind[1] + 1, axis] ) / 2
err = errs[ind]
return sign, threshold, err
This time, as shown below, we are using the makemoons dataset. Consider the case where the number of sample data is $ 100 $ as shown in the image below, the data with $ 1 $ label is $ 50 $, and the data with $ -1 $ label is $ 50 $.
This dataset is stored in $ X = (x_1, x_2) $. For convenience, define the index before sorting in ascending order by the following np.argsort
.
Next, perform the np.tri
method (upper and lower triangular matrix generation) as the matrix pred
that gives a tentative prediction value.
Then define this pred
-y
as a mistake
matrix. As a result, 99 sets of errata were created.
Finally, multiply this errata with the weight to find the loss function.
The boundary line is the average value of $ x_1 ^ m, x_1 ^ {m + 1} $, which is the boundary when the minimum value of this loss function is taken. In the case of Adaboost, there is also a process to update the weight for the incorrect label here.
Here are some of the things I thought were a good way to do this. Whether the predicted value is correct or not is determined by the truth value (bool type). In the contents of the error function earlier, it was $ 0 $ if it was hit, and $ 1 $ if it was not, so we will work it that way.
sample.ipynb
a = np.array([[1,1,1,1],[1,1,1,1],[1,1,0,1]])
b = np.array([1,1,1,0])
c1 = (a != b )
c2 = (a != b ).astype('int')
print(c1)
print(c2)
[[False False False True]
[False False False True]
[False False True True]]
[[0 0 0 1]
[0 0 0 1]
[0 0 1 1]]
Here is a simple example. Let $ a $ be the predicted value and $ b $ be the correct label to determine. You can see that the program returns $ 0 $ for False
this time, and $ 1 $ for True
if it is off.
Let's actually move it. Try going from $ 1 to $ 9 as the number of trials.
adaboost.ipynb
num_clf_list = [1, 2, 3, 4, 5, 6,7,8,9]
clfs = [AdaBoost(DecisionStump, num_clfs=M) for M in num_clf_list]
fig, axes = plt.subplots(3, 3, figsize=(15,15))
for i in range(len(clfs)):
clfs[i].fit(X, y)
plot_result(ax=np.ravel(axes)[i], clf=clfs[i], xx=xx, yy=yy, X=X, y=y)
np.ravel(axes)[i].set_title(f'M = {num_clf_list[i]}')
You can see that the line to classify gradually separates, and it seems that it will become an accurate classifier. In the case of a determined stock, as you can see from the definition, it is difficult to classify with a non-linear curve because it tries to classify with a straight line.
This time, I deepened my understanding of Adaboost with implementation. Although the algorithm itself is easy to understand, when it comes to implementing it, I found that I could not proceed without a good understanding of matrix calculations such as numpy
.
From here, I would like to improve my analytical skills by understanding algorithms that are often used in Kaggle such as XG boost.
The full program is here. https://github.com/Fumio-eisan/adaboost_20200427
Recommended Posts