This article was posted in IQ1's second Advent Calendar 2019.
Data preprocessing and model high-para search are inevitable for machine learning. However, since IQ is 1, I don't want to do it if I can do data preprocessing and high para search. If data preprocessing is included in this time-graded gradient boosting tree algorithm (lightgbm, etc.), missing values can be handled as they are, and there is no need to preprocess categorical variables, so the world becomes relatively IQ1 friendly. It was. On the other hand, high para search is too difficult for IQ1. It is very difficult because you have to know what each model's high para is and how much range to search.
lightgbm_tuner Recently, optuna has released a module called ʻoptuna.integration.lightgbm_tuner` to automate lightgbm's high para search. This module is also IQ1 friendly for a variety of reasons.
Please install the latest version of optuna. I think it was 0.19.0. Enter with pip install optuna --upgrade
.
The dataset used this time is Kaggle's House Prices. (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) This dataset is a dataset that predicts the price of the house from land information or building information.
First, download the data using kaggle's API
$ kaggle competitions download -c house-prices-advanced-regression-techniques
If you unzip the zip properly, train.csv and test.csv will appear. This time, submit is annoying, so I will only use train.csv. You can use any jupyter notebook, so open python and read it with pandas.
import pandas as pd
df=pd.read_csv("train.csv")
print(df.shape)
.out
(1460, 81)
We found that there are 81 columns in the 1460 data. Since the SalePrice
that predicts the ID is included in this, the feature amount that can be used is 79 dimensions.
Let's drop the columns that we don't need for the time being and set the objective variable separately.
y=df.SalePrice
X=df.drop(["Id","SalePrice"],axis=1)
Next, check the missing values. However, it does not process because it plunges into lightgbm. Just confirmation.
X.loc[:,pd.isnull(X).any(axis=0)].columns
.out
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
'MiscFeature'],
dtype='object')
Next, label encode the column whose element is String
as a categorical variable in lightgbm, and then set dtype to category
. In a nutshell, labelencoding just replaces the string with an integer value so that the elements don't double.
from sklearn.preprocessing import LabelEncoder
for name in X.columns:
if X[name].dtype=="object":
#NaN cannot be input to LabelEncoder"NAN"To
X[name]=X[name].fillna("NAN")
le = LabelEncoder()
le.fit(X[name])
encoded = le.transform(X[name])
X[name] = pd.Series(encoded).astype('category')
Since the pre-processing is finished, let lightgbm learn. This code is when I dig into a normal lightgbm
import lightgbm as lgb
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)
#For time measurement
#%%time
params={"objective":"regression",
"learning_rate":0.05}
model=lgb.train(params,
train_set=train_dataset,
valid_sets=[valid_dataset],
num_boost_round=300,
early_stopping_rounds=50)
.out
...(abridgement)...
Early stopping, best iteration is:
[113] valid_0's l2: 6.65263e+08
CPU times: user 3.11 s, sys: 537 ms, total: 3.65 s
Wall time: 4.47 s
I was able to learn. The learning time was 4.47 seconds. By the way, when I plotted the prediction result, it looked like this. The horizontal axis is the predicted value and the vertical axis is the true value.
Rewrite the above code to search for IQ1 high para
import lightgbm as lgb
import optuna.integration.lightgbm_tuner as lgb_tuner
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)
params={"objective":"regression",
"learning_rate":0.05,
"metric":"l2"}
model=lgb_tuner.train(params,
train_set=train_dataset,
valid_sets=[valid_dataset],
num_boost_round=300,
early_stopping_rounds=50)
Do you know where it was rewritten? I'm looking for a mistake in IQ1.
The rewritten places are the following 3 places
lgb.train
to lgb_tuner.train
metric
to optimize to params
By the way, the study time is as follows. It was slower than I expected ...
.out
CPU times: user 3min 24s, sys: 33.8 s, total: 3min 58s
Wall time: 3min 48s
The score of validation data is as follows.
model.best_score
.out
defaultdict(dict, {'valid_0': {'l2': 521150494.1730755}})
Comparing the results of this experiment, it looks like this. After tuning, the performance has improved properly.
lightgbm | lightgbm_tuner | |
---|---|---|
Learning time | 4.47 s | 228 s |
valid data accuracy(MSE) | 6.65263e+08 | 5.21150e+08 |
Even if you look at the plot, you can see that the one with the higher price (the one on the right) looks better.
If you use this, it seems that you can do machine learning even with IQ1! !! By the way, when I submitted the model made with this, it was about 2000th. (Since the number of participants is sample_submission.csv or more and 4900 people, there were many people with IQ1 or less)
When I tried high para search using optuna in my mood, the validation score went up, but the submit score got a little worse. (When submitting, relearn with learning_rate = 0.05, num_boosting_round = 1000, early_stopping_rounds = 50
)
Did you overfit the validation data? After all it is difficult to search for high para in IQ1.
** If you have any practical high para exploration advice, please comment! !! ** **
The tuning strategy this time is as follows.
――The parameters that can be tuned are lined up in a wide range.
--Set learning_rate
coarsely to increase the number of trials
(The lower the score, the better)
--Defopara's lightgbm submission score: 0.13852 --Submitted score with lightgbm_tuner: 0.13174 --Submitted score in optuna: 0.13401
import optuna
def objective(trial):
'''
trial:set of hyperparameter
'''
# hypyer param
bagging_fraction = trial.suggest_uniform("bagging_fraction",0,1)
bagging_freq = trial.suggest_int("bagging_freq",0,10)
feature_fraction = trial.suggest_uniform("feature_fraction",0,1)
lambda_l1 = trial.suggest_uniform("lambda_l1",0,50)
lambda_l2 = trial.suggest_uniform("lambda_l2",0,50)
min_child_samples = trial.suggest_int("min_child_samples",1,50)
num_leaves = trial.suggest_int("num_leaves",2,50)
max_depth = trial.suggest_int("max_depth",0,8)
params={"learning_rate":0.5,
"objective":"regression",
"bagging_fraction":bagging_fraction,
"bagging_freq":bagging_freq,
"feature_fraction":feature_fraction,
"lambda_l1":lambda_l1,
"lambda_l2":lambda_l2,
"min_child_samples":min_child_samples,
"num_leaves":num_leaves,
"max_depth":max_depth}
model_opt = lgb.train(params,train_set=train_dataset,valid_sets=[valid_dataset],
num_boost_round=70,early_stopping_rounds=10)
return model_opt.best_score["valid_0"]["l2"]
study = optuna.create_study()
study.optimize(objective, n_trials=500)
...(abridgement)...
[I 2019-12-01 15:02:35,075] Finished trial#499 resulted in value: 537618254.528029. Current best value is 461466711.4731979 with parameters: {'bagging_fraction': 0.9973929186258068, 'bagging_freq': 2, 'feature_fraction': 0.9469601028256658, 'lambda_l1': 10.1589501379876, 'lambda_l2': 0.0306013767707684, 'min_child_samples': 2, 'num_leaves': 35, 'max_depth': 2}.
The validation score was 4.61467e + 08
Recommended Posts