When dealing with categorical variables in the standard library LightGBM of Gradient Boosting Decision Tree, which is a standard in machine learning. The LightGBM version at the time of writing is 2.3.0.
There are at least three ways to specify a categorical variable, but at the time of writing (3) dtype ='category'
seems to be good. (1) (2) is also popular, but UserWarning appears; has it been de-encouragement recently?
lgb_train = lgb.Dataset(X_train, y_train, categorical_feature=['A'])
X_train
is pandas.DataFrame
and 'A'
is the column name of the categorical variable.
UserWarning appears:
python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
Yes, I specified categorical_feature in the Dataset, what?
gbm = lgb.train(params,
lgb_train,
categorical_feature=['A'],
)
UserWarning:
python3.7/site-packages/lightgbm/basic.py:1247: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['A']
Well, is it unencouragement to set categorical_feature here? If you set both Dataset
in (1) andtrain ()
in (2), UserWarning will not be possible, but I feel that it is uselessly duplicated.
X_train['A'] = X_train['A'].astype('category')
With this, UserWarning does not appear. If you set it to category type first, you do not have to specify categorical_feature twice in train and validation as in the case of (1). The category type uses a reasonably small integer type internally, so it's also RAM friendly. This looks good.
It is unknown when this UserWarning began to appear and whether it will continue. I wrote this article because I couldn't find any information on the net. It seems to be a recent change.