The first-hand gradient boosting tree in machine learning of table data is used as much as the first-hand 76 steps of shogi.
First, apply gradient boosting, and if you can see which features are likely to be effective, it will be even more exciting, so let's display the importance of the features of gradient boosting.
I tried it with Googlw Colaboratory. You can run it with your Jupyter or Python script without any problems.
The notebook is below. https://colab.research.google.com/drive/1N1gtzTHFRKsbm88NyuEKqBr9wNS3tU7K?usp=sharing
The data uses the following "speed date test". https://knowledge-ja.domo.com/Training/Self-Service_Training/Onboarding_Resources/Fun_Sample_Datasets
This is the data of an experiment in which after a 4-minute date with all the participants, they wanted to date again or evaluated the date. The data seems to be very interesting, but it was quite difficult to see the data because there were nearly 200 items that were not explained, so this time I will look at the execution method without being too particular about the data.
First, get the data.
! wget https://knowledge-ja.domo.com/@api/deki/files/5950/Speed_Dating_Data.csv
Install the package for encoding text columns.
! pip install category_encoders
Load CSV data into a data frame.
import pandas as pd
speed_date = pd.read_csv("Speed_Dating_Data.csv", encoding='cp932')
speed_date
Preprocess the data. The data is a little rough.
dec_o
column because it seemed too relevant to the matching answer in the column I wanted to date (in fact, the association to the answer is by far the highest).import category_encoders as encoders
label_cols = ['field', 'from', 'career', 'undergra']
#Processing excluding id column etc.
exist_match_df = speed_date
object_col = exist_match_df['match']
object_col = object_col.values.astype(int)
feature_cols = exist_match_df.iloc[:,13:97]
feature_cols = feature_cols.drop('dec_o', axis=1)
col_names = feature_cols.columns.values
feature_cols['zipcode'] = feature_cols['zipcode'].str.replace(',', '')
feature_cols['income'] = feature_cols['income'].str.replace(',', '')
feature_cols['tuition'] = feature_cols['tuition'].str.replace(',', '')
feature_cols['mn_sat'] = feature_cols['mn_sat'].str.replace(',', '')
ordinal_encoder = encoders.OrdinalEncoder(cols=label_cols, handle_unknown='impute')
feature_cols = ordinal_encoder.fit_transform(feature_cols)
feature_cols = feature_cols.values.astype(float)
feature_cols
Learning is performed based on the processed data.
import xgboost as xgb
from sklearn.model_selection import cross_validate, cross_val_predict, KFold
kfold = KFold(n_splits=5)
score_func = ["accuracy", "precision_macro", "recall_macro", "f1_macro"]
clf = xgb.XGBClassifier(objective="binary:logistic", max_depth=10, n_estimatoers=10000, early_stopping_rounds=20)
score = cross_validate(clf, feature_cols, object_col, cv=kfold, scoring=score_func, return_estimator=True)
print('acc: ' + str(score["test_accuracy"].mean()))
print('precision: ' + str(score["test_precision_macro"].mean()))
print('recall: ' + str(score["test_recall_macro"].mean()))
print('F1: ' + str(score["test_f1_macro"].mean()))
acc: 0.8350436362341039 precision: 0.6755307380632243 recall: 0.5681596439505251 F1: 0.5779607716750095
Looking at recall etc., I can not say that I learned very well, but this time the main subject is after this.
You can get the estimator by return_estimator = True
of cross_validate () earlier.
Since estimator has a highly relevant explanatory variable as feature_importances_, it outputs this.
import numpy as np
estimators = score["estimator"]
sum_score = np.zeros(len(col_names))
for i in range(5):
sum_score += estimators[i].feature_importances_
df_score = pd.DataFrame(sum_score/5, index=col_names, columns=["score"])
df_score.sort_values("score", ascending=False)
In this way, you can get the variables that you consider important when you train.
Recommended Posts