Get to know the feelings of gradient boosting trees

The first-hand gradient boosting tree in machine learning of table data is used as much as the first-hand 76 steps of shogi.

First, apply gradient boosting, and if you can see which features are likely to be effective, it will be even more exciting, so let's display the importance of the features of gradient boosting.

Execution environment and data

I tried it with Googlw Colaboratory. You can run it with your Jupyter or Python script without any problems.

The notebook is below. https://colab.research.google.com/drive/1N1gtzTHFRKsbm88NyuEKqBr9wNS3tU7K?usp=sharing

The data uses the following "speed date test". https://knowledge-ja.domo.com/Training/Self-Service_Training/Onboarding_Resources/Fun_Sample_Datasets

This is the data of an experiment in which after a 4-minute date with all the participants, they wanted to date again or evaluated the date. The data seems to be very interesting, but it was quite difficult to see the data because there were nearly 200 items that were not explained, so this time I will look at the execution method without being too particular about the data.

Source code and execution results

First, get the data.

! wget https://knowledge-ja.domo.com/@api/deki/files/5950/Speed_Dating_Data.csv

Install the package for encoding text columns.

! pip install category_encoders

Load CSV data into a data frame.

import pandas as pd
speed_date = pd.read_csv("Speed_Dating_Data.csv", encoding='cp932')
speed_date

Preprocess the data. The data is a little rough.

Matching is established. Predict using the match column as the objective variable.
The 13th to 97th columns are used as explanatory variables (the subsequent columns are data based on a later survey).
I removed the dec_o column because it seemed too relevant to the matching answer in the column I wanted to date (in fact, the association to the answer is by far the highest).
Remove some columns that have commas in the numbers.
Converts label data to numbers.

import category_encoders as encoders
label_cols = ['field', 'from', 'career', 'undergra']

#Processing excluding id column etc.
exist_match_df = speed_date

object_col = exist_match_df['match']
object_col = object_col.values.astype(int)

feature_cols = exist_match_df.iloc[:,13:97]
feature_cols = feature_cols.drop('dec_o', axis=1)
col_names = feature_cols.columns.values
feature_cols['zipcode'] = feature_cols['zipcode'].str.replace(',', '')
feature_cols['income'] = feature_cols['income'].str.replace(',', '')
feature_cols['tuition'] = feature_cols['tuition'].str.replace(',', '')
feature_cols['mn_sat'] = feature_cols['mn_sat'].str.replace(',', '')

ordinal_encoder = encoders.OrdinalEncoder(cols=label_cols, handle_unknown='impute')
feature_cols = ordinal_encoder.fit_transform(feature_cols)

feature_cols = feature_cols.values.astype(float)
feature_cols

Learning is performed based on the processed data.

Perform learning with xgboost.
Perform cross-validation of 5 divisions.
Displays Accuracy etc.

import xgboost as xgb
from sklearn.model_selection import cross_validate, cross_val_predict, KFold

kfold = KFold(n_splits=5)
score_func = ["accuracy", "precision_macro", "recall_macro", "f1_macro"]
 
clf = xgb.XGBClassifier(objective="binary:logistic", max_depth=10, n_estimatoers=10000, early_stopping_rounds=20)
 
score = cross_validate(clf, feature_cols, object_col, cv=kfold, scoring=score_func, return_estimator=True)
 
print('acc:       ' + str(score["test_accuracy"].mean()))
print('precision: ' + str(score["test_precision_macro"].mean()))
print('recall:    ' + str(score["test_recall_macro"].mean()))
print('F1:        ' + str(score["test_f1_macro"].mean()))

acc: 0.8350436362341039 precision: 0.6755307380632243 recall: 0.5681596439505251 F1: 0.5779607716750095

Looking at recall etc., I can not say that I learned very well, but this time the main subject is after this.

Output of execution result

You can get the estimator by return_estimator = True of cross_validate () earlier. Since estimator has a highly relevant explanatory variable as feature_importances_, it outputs this.

import numpy as np

estimators = score["estimator"]
sum_score = np.zeros(len(col_names))
 
for i in range(5):
    sum_score += estimators[i].feature_importances_
 
df_score = pd.DataFrame(sum_score/5, index=col_names, columns=["score"])
df_score.sort_values("score", ascending=False)

In this way, you can get the variables that you consider important when you train.