What is ensemble learning?

-** A technique to combine multiple learners ** to get better predictions. In most cases you will get better results than using a single model alone.

--Specifically, combine the predicted values of multiple predictors by processing such as ** "take the average value" ** or ** "take the majority vote" **.

--In recent years, ** "boostering" and "random forest" **, which have been attracting attention in the field of data analysis, are also types of ensemble learning.

Bagging

It is meaningless to say "combine multiple learners (predictors)" or to combine models trained with ** the same data ** and ** the same algorithm **.

That said, there is usually only one piece of data.

There, I use a technique called ** "bootstrap" **.

-** Bootstrap **… Randomly (with duplication) n data are sampled from the training data.

Generate N bootstrap data sets of size n from the training data.

When N prediction models are created using these data, each prediction value is yn (x).

Therefore, the final predicted value of the model using bagging is as follows.

Stacking

For the previous bagging, we considered a simple average of N predicted values. In other words, the predicted values here are evaluated equally, and ** the importance of each model cannot be considered. ** **

--For stacking, the ** weighted average ** of each predicted value is used as the final predicted value, and the importance of each model is taken into consideration.

Therefore, the final predicted value is

pumping

--Pumping is a method for ** searching for the best model from multiple predictors **.

--Generate N models using the bootstrap data set, apply them to the original training data, and select the model that minimizes the prediction error as the best model.

――It seems that there is no merit at first glance compared to bagging and stacking, but ** When an undesired solution is obtained using poor quality data, it is better to use a bootstrap data set excluding those data. A solution may be obtained **.

Random forest

-Random forest is a method that uses a "decision tree" as the base learner for the "bagging" mentioned above. The specific algorithm is as follows (1) Extract N bootstrap data sets from the training data. (2) Use these data to generate N trees Tn. --At this time, only m features are randomly selected from p features. (3) Take ** average ** in the case of regression and ** majority vote ** in the case of classification and use it as the final predicted value.

Why use a decision tree for the base learner?

--The basic idea of bagging is to reduce errors by combining multiple models with large variance and small bias.

(1) Large variance / small bias → Complex model (decision tree, nearest neighbor method) (2) Small variance / large bias → Simple model (linear regression)

--The decision tree is an ideal model with large variance and small bias as a base learner for bagging ** (overfitting can be corrected by averaging multiple models.) **

--Other merits such as "high speed", "regardless of variable data type", and "invariant to scaling".

Why use only some features?

――In ensemble learning, the lower the correlation between ** models, the higher the accuracy of the final predicted value. ** **

→ It is meaningless to collect many similar models, and the performance is higher when the models learned with different data are combined.

--In addition to bootstrapping, the correlation between models is lowered by changing the variables used for training in each model.

What is boosting?

-"Boosting" is one of the ensemble learning methods.

--Train the base learner ** sequentially **. (Generate the next learner based on the previous learner) Techniques such as bagging and stacking ultimately combine multiple base learners to produce predictive values. (There is no relationship between the learners before and after)

-** Two methods called "AdaBoost" ** and ** "gradient boosting" ** are typical.

-The algorithm realized by the library ** "Xgboost" **, which is quite popular in ** Kaggle **.

AdaBoost

--A weighted data set ** is used in the training of each learner. --Give greater weight ** to data points that the previous learner ** misclassified. --At first, give equal weight to all. --The predicted values of those base learners are finally combined to obtain the final predicted value.

What is Gradient Boosting?

--Fit to the ** "residual" ** of the previous learner. --A "decision tree" is often used as a base learner. -Algorithm implemented by ** xgboost **

Experiment ①

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

moons=make_moons(n_samples=200,noise=0.2,random_state=0)

X=moons[0]
y=moons[1]

from matplotlib.colors import ListedColormap

def plot_decision_boundary(model,X,y):
    _x1 = np.linspace(X[:,0].min()-0.5,X[:,0].max()+0.5,100)
    _x2 = np.linspace(X[:,1].min()-0.5,X[:,1].max()+0.5,100)
    x1,x2 = np.meshgrid(_x1,_x2)
    X_new=np.c_[x1.ravel(),x2.ravel()]
    y_pred=model.predict(X_new).reshape(x1.shape)
    custom_cmap=ListedColormap(["mediumblue","orangered"])
    plt.contourf(x1,x2,y_pred,cmap=custom_cmap,alpha=0.3)
    
def plot_dataset(X,y):
    plt.plot(X[:,0][y==0],X[:,1][y==0],"bo",ms=15)
    plt.plot(X[:,0][y==1],X[:,1][y==1],"r^",ms=15)
    plt.xlabel("$x_1$",fontsize=30)
    plt.ylabel("$x_2$",fontsize=30,rotation=0)

plt.figure(figsize=(12,8))
plot_dataset(X,y)
plt.show()

Decision tree analysis

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier #(scikit-Decision tree analysis with learn(CART method))

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

tree_clf=DecisionTreeClassifier().fit(X_train,y_train) #default no upper limit

plt.figure(figsize=(12,8))
plot_decision_boundary(tree_clf,X,y)
plot_dataset(X,y)
plt.show()

Random forest

from sklearn.ensemble import RandomForestClassifier

random_forest=RandomForestClassifier(n_estimators=100,random_state=0).fit(X_train,y_train)
#The default value is 10. Specify the number of decision trees used for bagging.

plt.figure(figsize=(12,8))
plot_decision_boundary(random_forest,X,y)
plot_dataset(X,y)
plt.show()

Experiment ②

from sklearn.datasets import load_iris

iris=load_iris()
X_iris=iris.data
y_iris=iris.target

random_forest_iris=RandomForestClassifier(random_state=0).fit(X_iris,y_iris)

#How important each feature is
random_forest_iris.feature_importances_

plt.figure(figsize=(12,8))
plt.barh(range(iris.data.shape[1]),random_forest_iris.feature_importances_,height=0.5)
plt.yticks(range(iris.data.shape[1]),iris.feature_names,fontsize=20)
plt.xlabel("Feature importance",fontsize=30)
plt.show()

Experiment ③

The dataset used was Kaggle's Titanic. https://www.kaggle.com/c/titanic

import pandas as pd

df=pd.read_csv("train.csv")
df["Age"]=df["Age"].fillna(df["Age"].mean())
df["Embarked"]=df["Embarked"].fillna(df["Embarked"].mode()[0])#Mode

from sklearn.preprocessing import LabelEncoder

cat_features=["Sex","Embarked"]

for col in cat_features:
    lbl = LabelEncoder()
    df[col]=lbl.fit_transform(list(df[col].values))

X=df.drop(columns=["PassengerId","Survived","Name","Ticket","Cabin"])
y=df["Survived"]

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)

tree=DecisionTreeClassifier().fit(X_train,y_train)
print(tree.score(X_test,y_test))

rnd_forest=RandomForestClassifier(n_estimators=500,max_depth=5,random_state=0).fit(X_train,y_train)

print(rnd_forest.score(X_test,y_test))

Submission form (output as submisson.csv)

#Submission form
test_df=pd.read_csv("test.csv")
test_df["Age"]=test_df["Age"].fillna(test_df["Age"].mean())
test_df["Fare"]=test_df["Fare"].fillna(test_df["Fare"].mean())
test_df["Embarked"]=test_df["Embarked"].fillna(test_df["Embarked"].mode()[0])#Mode

for col in cat_features:
    lbl = LabelEncoder()
    test_df[col]=lbl.fit_transform(list(test_df[col].values))

X_pred=test_df.drop(columns=["PassengerId","Name","Ticket","Cabin"])
ID=test_df["PassengerId"]

prediction=rnd_forest.predict(X_pred)

submisson=pd.DataFrame({
    "PassengerId":ID,
    "Survived":prediction
})

submisson.to_csv("submisson.csv",index=False)