This article was posted in IQ1's second Advent Calendar 2019.

IQ1 machine learning

Data preprocessing and model high-para search are inevitable for machine learning. However, since IQ is 1, I don't want to do it if I can do data preprocessing and high para search. If data preprocessing is included in this time-graded gradient boosting tree algorithm (lightgbm, etc.), missing values can be handled as they are, and there is no need to preprocess categorical variables, so the world becomes relatively IQ1 friendly. It was. On the other hand, high para search is too difficult for IQ1. It is very difficult because you have to know what each model's high para is and how much range to search.

lightgbm_tuner Recently, optuna has released a module called ʻoptuna.integration.lightgbm_tuner` to automate lightgbm's high para search. This module is also IQ1 friendly for a variety of reasons.

Fully automatic search for high para
Search by step-wise (search for each parameter), so you can search at high speed.
You can use it immediately by rewriting the existing lightgbm code in a few places (described later).

IQ1 automatic high para search

Please install the latest version of optuna. I think it was 0.19.0. Enter with pip install optuna --upgrade.

Data set preparation

The dataset used this time is Kaggle's House Prices. (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) This dataset is a dataset that predicts the price of the house from land information or building information.

First, download the data using kaggle's API

$ kaggle competitions download -c house-prices-advanced-regression-techniques

If you unzip the zip properly, train.csv and test.csv will appear. This time, submit is annoying, so I will only use train.csv. You can use any jupyter notebook, so open python and read it with pandas.

import pandas as pd

df=pd.read_csv("train.csv")
print(df.shape)

`.out`


(1460, 81)

We found that there are 81 columns in the 1460 data. Since the SalePrice that predicts the ID is included in this, the feature amount that can be used is 79 dimensions.

Let's drop the columns that we don't need for the time being and set the objective variable separately.

y=df.SalePrice
X=df.drop(["Id","SalePrice"],axis=1)

Next, check the missing values. However, it does not process because it plunges into lightgbm. Just confirmation.

X.loc[:,pd.isnull(X).any(axis=0)].columns

`.out`


Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

Next, label encode the column whose element is String as a categorical variable in lightgbm, and then set dtype to category. In a nutshell, labelencoding just replaces the string with an integer value so that the elements don't double.

from sklearn.preprocessing import LabelEncoder

for name in X.columns:
    if X[name].dtype=="object":
        #NaN cannot be input to LabelEncoder"NAN"To
        X[name]=X[name].fillna("NAN")
        le = LabelEncoder()
        le.fit(X[name])
        encoded = le.transform(X[name])
        X[name] = pd.Series(encoded).astype('category')

Learn with ordinary lightgbm

Since the pre-processing is finished, let lightgbm learn. This code is when I dig into a normal lightgbm

import lightgbm as lgb
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)

#For time measurement
#%%time
params={"objective":"regression",
                    "learning_rate":0.05}
model=lgb.train(params,
                train_set=train_dataset,
                valid_sets=[valid_dataset],
                num_boost_round=300,
                early_stopping_rounds=50)

`.out`


...(abridgement)...
Early stopping, best iteration is:
[113]	valid_0's l2: 6.65263e+08

CPU times: user 3.11 s, sys: 537 ms, total: 3.65 s
Wall time: 4.47 s

I was able to learn. The learning time was 4.47 seconds. By the way, when I plotted the prediction result, it looked like this. The horizontal axis is the predicted value and the vertical axis is the true value.

Learn with lightgbm_tuner

Rewrite the above code to search for IQ1 high para

import lightgbm as lgb
import optuna.integration.lightgbm_tuner as lgb_tuner
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)

params={"objective":"regression",
                    "learning_rate":0.05,
                    "metric":"l2"}
model=lgb_tuner.train(params,
                      train_set=train_dataset,
                      valid_sets=[valid_dataset],
                      num_boost_round=300,
                      early_stopping_rounds=50)

Do you know where it was rewritten? I'm looking for a mistake in IQ1.

The rewritten places are the following 3 places

Add import statement
Change from lgb.train to lgb_tuner.train
Add the value metric to optimize to params

By the way, the study time is as follows. It was slower than I expected ...

`.out`


CPU times: user 3min 24s, sys: 33.8 s, total: 3min 58s
Wall time: 3min 48s

The score of validation data is as follows.

model.best_score

`.out`


defaultdict(dict, {'valid_0': {'l2': 521150494.1730755}})

Comparing the results of this experiment, it looks like this. After tuning, the performance has improved properly.

	lightgbm	lightgbm_tuner
Learning time	4.47 s	228 s
valid data accuracy(MSE)	6.65263e+08	5.21150e+08

Even if you look at the plot, you can see that the one with the higher price (the one on the right) looks better.

end

If you use this, it seems that you can do machine learning even with IQ1! !! By the way, when I submitted the model made with this, it was about 2000th. (Since the number of participants is sample_submission.csv or more and 4900 people, there were many people with IQ1 or less)

Postscript

When I tried high para search using optuna in my mood, the validation score went up, but the submit score got a little worse. (When submitting, relearn with learning_rate = 0.05, num_boosting_round = 1000, early_stopping_rounds = 50) Did you overfit the validation data? After all it is difficult to search for high para in IQ1.

** If you have any practical high para exploration advice, please comment! !! ** **

The tuning strategy this time is as follows.

――The parameters that can be tuned are lined up in a wide range. --Set learning_rate coarsely to increase the number of trials

(The lower the score, the better)

--Defopara's lightgbm submission score: 0.13852 --Submitted score with lightgbm_tuner: 0.13174 --Submitted score in optuna: 0.13401

import optuna

def objective(trial):
    '''
    trial:set of hyperparameter    
    '''
    # hypyer param
    bagging_fraction = trial.suggest_uniform("bagging_fraction",0,1)
    bagging_freq = trial.suggest_int("bagging_freq",0,10)
    feature_fraction = trial.suggest_uniform("feature_fraction",0,1)
    lambda_l1 = trial.suggest_uniform("lambda_l1",0,50)
    lambda_l2 = trial.suggest_uniform("lambda_l2",0,50)
    min_child_samples = trial.suggest_int("min_child_samples",1,50)
    num_leaves = trial.suggest_int("num_leaves",2,50)
    max_depth = trial.suggest_int("max_depth",0,8)
    params={"learning_rate":0.5,
                    "objective":"regression",
                    "bagging_fraction":bagging_fraction,
                    "bagging_freq":bagging_freq,
                    "feature_fraction":feature_fraction,
                    "lambda_l1":lambda_l1,
                    "lambda_l2":lambda_l2,
                    "min_child_samples":min_child_samples,
                    "num_leaves":num_leaves,
                    "max_depth":max_depth}

    model_opt = lgb.train(params,train_set=train_dataset,valid_sets=[valid_dataset],
                          num_boost_round=70,early_stopping_rounds=10)
 
    return model_opt.best_score["valid_0"]["l2"]

study = optuna.create_study()
study.optimize(objective, n_trials=500)

...(abridgement)...
[I 2019-12-01 15:02:35,075] Finished trial#499 resulted in value: 537618254.528029. Current best value is 461466711.4731979 with parameters: {'bagging_fraction': 0.9973929186258068, 'bagging_freq': 2, 'feature_fraction': 0.9469601028256658, 'lambda_l1': 10.1589501379876, 'lambda_l2': 0.0306013767707684, 'min_child_samples': 2, 'num_leaves': 35, 'max_depth': 2}.

The validation score was 4.61467e + 08

I don't want to search for high para because it is IQ1 (how to use lightgbm_tuner)

IQ1 machine learning

IQ1 automatic high para search

Data set preparation

.out

.out

Learn with ordinary lightgbm

.out

Learn with lightgbm_tuner

.out

.out

end

Postscript

`.out`

`.out`

`.out`

`.out`

`.out`