Hello! My name is Nike! Suddenly, I finally started twitter! I can't tell you anything useful, but I hope to gain the knowledge and experience of my seniors!
Well then, thank you again this time!
Intermediate Machine Learning digs deeper into machine learning
~ Flow of Intermediate Machine Learning ~
This time it is the contents of 6!
eXtreme Gradient Boosting Gradient Boosting is translated as Gradient Boosting. XGBoost is an implementation of this gradient boosting with greater accuracy and speed. Scikit-learn has another gradient boosting technique, but XGBoost seems to have technical advantages. We'll dig a little deeper into XGBoost.
Python: Try using XGBoost (reference site) What is a gradient boosting decision tree (reference site)
I've already learned a technique called Random Forest, which is categorized as __ "ensemble methods" . The "ensemble method" is the combination of guessing from multiple models. Random forests are an ensemble technique because they integrate multiple decision tree guesses. (* There is also the term "" ensemble learning "__, but I couldn't find a clear difference *)
There are three types of ensemble learning
--Bagging --Boosting --Stacking
And gradient boosting is also one of ensemble learning.
First, boosting is a method of continuously adding __models __ to create an entire (ensemble) by repeating a fixed procedure. Initially it is one immature model, but it will be optimized more and more by the models that will be added later.
And gradient boosting is a technique that "redefines as a problem that minimizes the loss function and uses gradient information to find the direction that minimizes the loss." About Gradient Boosting --Preparation-- (Reference Site)
Quoted from Kaggle
First, initialize the existing model. Then enter the cycle below.
The data to be used is in here as before. It is divided into three parts.
First of all, preparation.
import pandas as pd
from sklearn.model_selection import train_test_split
# Data reading
X = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')
# Exclude rows where the objective variable is missing and separate the objective variable from the data
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
# Separate verification data and learning data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
# "Cardinality" represents the number of unique values in the column
# Extract columns of category data with low cardinality
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Extract numerical data
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Combine the extracted numbers and columns of category data
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
# Perform One-Hot encoding (pandas allows you to write shorter code than before)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)
We're still doing the same thing here with Random Forest. Define the model → Fit the model → Guess → Verify This is the flow.
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
# Define model
my_model_1 = XGBRegressor(random_state=0)
# Fit model
my_model_1.fit(X_train, y_train)
# Guess
predictions_1 = predictions_1 = my_model_1.predict(X_valid)
# Verification: Calculate MAE
mae_1 = mean_absolute_error(predictions_1, y_valid)
print("Mean Absolute Error:" , mae_1)
Execution result
Mean Absolute Error: 17662.736729452055
Here is the real thrill of XGBoost. XGBoost has a terrifying number of parameters. Adjusting this parameter will improve performance. (* Of course, other models can also be improved in performance by adjusting the parameters *) Here, typical parameters that improve the performance of the model are introduced.
--_n_estimators __ .. .. The number to repeat the above cycle. If it is too small, it will be overfitting, and if it is too large, it will be overfitting. I pass about 100-1000, but I am deeply involved in the learning_rate that will appear later. --learning_rate .. .. Determines the weight size __ during learning. The default is 0.1. -- eval_metric __ .. . You can determine the loss function __. --early_stopping_rounds .. .. This parameter automatically derives the number of n_estimators. For example, if you pass "5", the cycle ends when the score of the 5th validation deteriorates. Therefore, set n_estimators to a large value. If you set this parameter, you must pass validation data to eval_set. --eval_set .. .. It is for passing verification data. --n_jobs .. .. Decide the number of parallel processes. Make it her number of cores on your PC. It doesn't improve the score, and adjusting when the dataset is very large will reduce the execution time. --verbose .. If set to False, the calculation process in the middle will not be displayed. If you pass a number, the calculation process separated by each number you pass will be displayed.
# Define model
my_model_2 = XGBRegressor(n_estimators=1000,
learning_rate=0.05,
eval_metric='mae')
# Fit model
my_model_2.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=1)
# Predict
predictions_2 = my_model_2.predict(X_valid)
# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)
print("Mean Absolute Error:" , mae_2)
Execution result If you set verbose to False, the calculation process in the middle will not be displayed. You can see that the MAE is steadily decreasing during the calculation process.
[0] validation_0-mae:172457.42188
[1] validation_0-mae:163972.64062
[2] validation_0-mae:155982.82812
......
[154] validation_0-mae:16951.49609
[155] validation_0-mae: 16948.06641 # ←← This is the minimum value, but it is different from the value shown in the result ... I don't know the reason ... (crying)
[156] validation_0-mae:16954.53516
[157] validation_0-mae:16962.16211
[158] validation_0-mae:16956.42383
[159] validation_0-mae:16956.51172
[160] validation_0-mae:16952.38086
mae_2, Mean Absolute Error: 16948.067128638697
Let's set n_estimators = 1. I'm looking forward to the result.
Define the model
my_model_3 = XGBRegressor(n_estimators=1,
learning_rate=0.05,
eval_metric='mae')
Fit the model
my_model_3.fit(X_train, y_train)
Get predictions
predictions_3 = my_model_3.predict(X_valid)
Calculate MAE
mae_3 = mean_absolute_error(predictions_3, y_valid)
print("Mean Absolute Error:" , mae_3)
Execution result
mae_3, Mean Absolute Error: 172457.41701141777
For me as a beginner, this time it was quite difficult. Boosting and new jargon have suddenly increased in the ensemble method ... I was really helped by various site owners. truly, thank you very much.
Next time, Intermediate Machine Learning will be completed! With this, I can say that I am finally studying machine learning. I'll finish it by the end of the year!
Thank you for reading until the end!
Recommended Posts