Introduction

Ensemble learning is very important in machine learning. Write an article that makes it easy to understand and review the content.

What is ensemble learning?

A method of creating one learning model by fusing multiple models (learners).

Prediction accuracy should be better if you combine multiple models rather than training with just one model! A method born from the idea.

What does it mean to improve prediction accuracy?

Minimize the error between the predicted value and the actual value.

The important keywords for actually checking the prediction accuracy are "bias" and "variance"!

bias

○ Average of error between actual value and predicted value. ・ If the value is small, the prediction accuracy is high. ・ If the value is large, the prediction accuracy is low.

Variance

○ A value that indicates how scattered the predicted values are. ・ If the value is small, the predicted value is settled. (Maybe overfitting) ・ If the value is large, the predicted values are scattered.

Bias and variance are a trade-off relationship! !!

・ If the prediction accuracy is high, overfitting is likely to occur. ・ If the predicted values are scattered, the prediction accuracy is low.

It is important to adjust the balance between these two!

Typical method of ensemble learning

① Bagging

Extract different data (bootstrap method) to create multiple different models (weak learners). After that, the average of the created multiple models is used as the final model.

Bootstrap method: The same amount of data is randomly extracted multiple times from all data. (It does not divide the data)

○ Features ・ Variance can be reduced. ・ Learning time is short due to parallel processing. (Multiple data is extracted by the bootstrap method and multiple data are learned at the same time)

○ Representative model Random forest

② Boosting

We will learn the same data many times and build a more accurate model.

○ Features ・ Bias can be reduced. (Better accuracy than bagging can be expected) ・ Learning time is long due to serial processing. (Repeatly build a model that improves the result of the model)

○ Representative model XGBoost / LightGBM

③ Stacking

Create a model by combining multiple models.

Specifically, the flow is that the values predicted by "multiple regression analysis", "random forest", and "LightGBM" are used as features and predicted by multiple regression analysis.

In other words, the three predicted values predicted by the three models are the input values for multiple regression analysis.

You may be wondering what model to combine, but it is common to combine a decision tree system (random forest, XGBoost, etc.) and a regression system (multiple regression analysis). (By combining models of different strains, there is a possibility that features that cannot be discovered by themselves may be supplemented.)

○ Features ・ Prediction accuracy is improved. (Basically, better accuracy than the single model) ・ It becomes difficult to interpret and analyze the results. ・ Learning time becomes longer

Stacking implementation

[SIGNET] The theme is forecasting the number of rental bicycle users. Link below https://signate.jp/competitions/114

Data preprocessing is omitted. (Because of simple data preprocessing, the result may be different.)

As a flow, the predicted values of multiple regression analysis, random forest, and LightGBM are subjected to multiple regression analysis to be the final predicted values. (Metamodel)

Meta model: A model that summarizes the predicted values of the first stage

Data reading

`python.py`


import pandas as pd
import numpy as np

#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop(['id', 'dteday', 'yr', 'atemp'],axis =1)


###Preprocess the data if necessary.


#Explanatory variable
train = df.drop('cnt', axis=1)
#Objective variable
test = df['cnt']

#Divide the data into three
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(train, test, test_size=0.2, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2, random_state=0)
print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)
print(y_train.shape)
print(y_valid.shape)
print(y_test.shape)

(5532, 8)
(1384, 8)
(1729, 8)
(5532,)
(1384,)
(1729,)

Predicted value in the first stage

`pyhon.py`


from sklearn.linear_model import LinearRegression #Multiple regression analysis
from sklearn.ensemble import RandomForestRegressor #Random forest
import lightgbm as lgb #LightGBM
#Evaluation(Average squared error)
from sklearn.metrics import mean_squared_error

#Model instance
model_1 = LinearRegression()
model_2 = RandomForestRegressor()
model_3 = lgb.LGBMRegressor()

#Model learning
model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)

#Creating Predicted Values
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)

#Check the prediction accuracy of each model by mean square error
print ("Average squared error in multiple regression analysis: {:.4f}".format(mean_squared_error(y_test, pred_1)))
print ("Random forest mean squared error: {:.4f}".format(mean_squared_error(y_test, pred_2)))
print ("LightGBM mean squared error: {:.4f}".format(mean_squared_error(y_test, pred_3)))


Average squared error in multiple regression analysis: 6825.7104
Random forest mean squared error: 4419.4774
LightGBM mean squared error: 4043.2921

Stacking implementation

`python.py`


#First stage forecast
first_pred_1 = model_1.predict(X_valid)
first_pred_2 = model_2.predict(X_valid)
first_pred_3 = model_3.predict(X_valid)

#Summarize the predicted values of the first stage (features of metamodel)
stack_pred = np.column_stack((first_pred_1,first_pred_2,first_pred_3))

#Creating a metamodel
meta_model = LinearRegression()
#First-stage forecast answer= y_valid
meta_model.fit(stack_pred, y_valid)

#Check the stacking accuracy with the value predicted in advance
stack_test_pred = np.column_stack((pred_1, pred_2, pred_3))
meta_test_pred = meta_model.predict(stack_test_pred)
print ("Metamodel mean squared error: {:.4f}".format(mean_squared_error(y_test, meta_test_pred)))

Metamodel mean squared error: 4030.9495

in conclusion

The prediction accuracy is slightly better than the model alone.

○ Improvement points ・ Adjustment of parameters of each model ・ Change or increase the model of the first stage ・ Increase the number of versions with different parameters while keeping the number of models ex Random Forest n_estimators = 50, n_estimators = 100, n_estimators = 1000, etc.

Ensemble learning summary! !! (With implementation)

Introduction

What is ensemble learning?

What does it mean to improve prediction accuracy?

bias

Variance

Typical method of ensemble learning

① Bagging

② Boosting

③ Stacking

Stacking implementation

Data reading

python.py

Predicted value in the first stage

pyhon.py

Stacking implementation

python.py

in conclusion

`python.py`

`pyhon.py`

`python.py`