[Kaggle] Try using LGBM

1. Purpose

Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing it myself, so this time I would like to describe what I tried about LGBM and the meaning of each parameter. Purpose.

2. Benefits of gradient boosting

・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)

-The difference from Random Forest is that the trees are made in series.

With the above features, gradient boosting seems to be often used.

3. Try using LGBM

I will try to implement it again using Kaggle's Houseprice.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

(1) Pretreatment

(I) Import

import numpy as np
import pandas as pd

#For data division
from sklearn.model_selection import train_test_split

#XGBoost
import xgboost as xgb

(Ii) Data reading / combining

#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False

df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)

(Iii) Dummy variable

df_all = pd.get_dummies(df_all, drop_first=True)

(Iv) Data division

#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)

df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)
#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(2) Try using LGBM

(I) LGBM data creation

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)

※important point

-To use LGBM, it is necessary to process the data with lgb.Dataset. -In xgboost, df_test (original test data) also needed to be processed with ".Dataset", but LGBM does not. Please note that the usage of data around here is slightly different between xgboost and LGBM.

(Ii) Parameter setting

params = {
        #Regression problem
        'random_state':1234, 'verbose':0,
        #Indicators for learning(RMSE)
        'metrics': 'rmse',
    }
num_round = 100

* Brief explanation of parameters

See below for details https://lightgbm.readthedocs.io/en/latest/Parameters.html

・ Verbosity: How much information is displayed during learning. The default is 1. -Measurements: How to measure miscalculation functions. -Num_round: Maximum number of learnings.

(Iii) Model training

model = lgb.train(params, lgb_train, num_boost_round = num_round)

(Iv) Forecast

#Forecast
prediction_LG = model.predict(df_test)

#Rounding decimals
prediction_LG = np.round(prediction_LG)

(V) Creating a file for submission

submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_LG})

That is all!

4. Conclusion

What did you think. Although LGBM is famous, it seems that beginners take time to implement it.

I've introduced a simple code so that you can understand how to implement it as easily as possible. Also, you can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.

I hope it will help you to deepen your understanding.

Recommended Posts

[Kaggle] Try using LGBM
[Kaggle] Try using xg boost
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using PDFMiner
Try using geopandas
Try using scipy
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try function optimization using Hyperopt
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try Predict Future Engineering
Try using the Twitter API
Try using AWS SageMaker Studio
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Try machine learning with Kaggle
Try using folium with anaconda
Try using Janus gateway's Admin API
[Statistics] [R] Try using quantile regression.
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Try using Pillow on iPython (Part 2)
Try using Pleasant's API (python / FastAPI)
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try using PyCharm's remote debugging feature
Try using ArUco on Raspberry Pi
Try using cheap LiDAR (Camsense X1)
Try using Pillow on iPython (Part 3)
Reinforcement learning 8 Try using Chainer UI
Try to get statistics using e-Stat
Try using Python argparse's action API
Try using the Python Cmd module
Try using Python's networkx with AtCoder
Try using Leap Motion in Python
[Kaggle] I tried undersampling using imbalanced-learn