The progress of tools around machine learning is so fast that there are many experiences that new functions have been implemented without confirmation for several months. It seems that CatBoost was able to process even text items inside the model before I knew it. Try CatBoost's text column specification function
When I looked it up, it seems that it was added from v0.22
released on 3/3/2020.
I don't know ... lol
I was wondering what kind of processing would be done, so I immediately investigated it.
See below for details, excerpted here as appropriate.
As with the tutorial, use rotten_tomatoes
as data.
A dataset with a mixture of numeric items, category items, and text items with rating_10
as the objective variable from 0 to 10.
from catboost import Pool, CatBoostClassifier
from catboost.datasets import rotten_tomatoes
from sklearn.metrics import accuracy_score
#Category item list
list_cat_features = ['rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic', 'publisher']
#Text item list
list_text_features = ['synopsis', 'genre', 'director', 'writer', 'review']
def get_processed_rotten_tomatoes():
train, test = rotten_tomatoes()
def fill_na(df, features):
for feature in features:
df[feature].fillna('', inplace=True)
def preprocess_data_part(data_part):
#Since it is an operation check, the item that takes time to create features is drop
data_part = data_part.drop(['id', 'theater_date', 'dvd_date', 'rating', 'date'], axis=1)
fill_na(data_part, list_cat_features)
fill_na(data_part, list_text_features)
X = data_part.drop(['rating_10'], axis=1)
y = data_part['rating_10']
return X, y
X_train, y_train = preprocess_data_part(train)
X_test, y_test = preprocess_data_part(test)
return X_train, X_test, y_train, y_test
#Divided into train and test
X_train, X_test, y_train, y_test = get_processed_rotten_tomatoes()
#Show only text items
X_train[list_text_features].head()
Just pass the list of item names to text_features
as with category_features
.
The following classes and methods can take text_features
as arguments.
Here, set it to Pool ()
.
#train dataset
train_pool = Pool(
X_train,
y_train,
cat_features=list_cat_features,
text_features=list_text_features,
feature_names=list(X_train)
)
#test dataset
test_pool = Pool(
X_test,
y_test,
cat_features=list_cat_features,
text_features=list_text_features,
feature_names=list(X_test)
)
catboost_default_params = {
'iterations': 1000,
'learning_rate': 0.03,
'eval_metric': 'Accuracy',
'task_type': 'GPU', # 'CPU'Is not supported, CatBoostError will occur
'random_seed': 0,
'verbose': 100
}
#Multi-class classification
clf = CatBoostClassifier(**catboost_default_params)
clf.fit(train_pool)
y_pred = clf.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.4f}")
accuracy = 0.4699
Since random_seed
is not set in the tutorial, it does not exactly match ʻaccuracy` in the tutorial, but the result is almost the same.
According to the verification result of the tutorial, when the text feature is dropped, it is 0.4562, so the accuracy is improved by several points.
When converting from a text item to a numeric item for the item specified by text_features
, three elements can be set.
tokenizers
[doc]dictionaries
[doc]feature_calcers
[doc]Before tokenizers
['cats so cute :)',
'mouse skare ...',
'cat defeated mouse',
'cute : mice gather army !',
'army mice defeated cat :(',
'cat offers peace',
'cat skared :(',
'cat mouse live peace :)']
After tokenizers (split with half-width space as delimiter)
[['cat', 'so', 'cute', ':)'],
['mouse', 'skare', '...'],
['cat', 'defeat', 'mouse'],
['cute', ':', 'mice', 'gather', 'army', '!'],
['army', 'mice', 'defeat', 'cat', ':('],
['cat', 'offer', 'peace'],
['cat', 'skare', ':('],
['cat', 'mouse', 'live', 'peace', ':)']]
It is possible to specify the above combination with feature_processing
.
Those settings are passed to catboost_params
as an argument as a dict.
Initial setting
{
"tokenizers" : [{
"tokenizer_id" : "Space", # "Space"Define tokenizers by name
"separator_type" : "ByDelimiter", #Split by delimiter
"delimiter" : " " #Delimiter is a half-width space
}],
"dictionaries" : [{
"dictionary_id" : "BiGram", # "BiGram"Define dictionaries by the name of
"max_dictionary_size" : "50000",
"occurrence_lower_bound" : "3",
"gram_order" : "2" # n-gram n=2
}, {
"dictionary_id" : "Word", # "Word"Define dictionaries by the name of
"max_dictionary_size" : "50000",
"occurrence_lower_bound" : "3",
"gram_order" : "1" # n-gram n=1
}],
"feature_processing" : {
"default" : [{ # tokenizers, dictionaries, feature_Define a combination of calcers
"dictionaries_names" : ["BiGram", "Word"],
"feature_calcers" : ["BoW"],
"tokenizers_names" : ["Space"]
}, {
"dictionaries_names" : ["Word"],
"feature_calcers" : ["NaiveBayes"],
"tokenizers_names" : ["Space"]
}],
}
}
https://catboost.ai/docs/references/text-processing__test-processing__default-value.html
Extensibility is high because various combinations can be defined, but there are many choices of setting values, so refer to the reference.
OneHotEncoding has been unnecessary for some time due to the internal processing of category items. Now that we can handle text items, we have included text items. It is now possible to create a baseline model without creating features even with table data. LightGBM is the most used baseline model in Kaggle these days, Cat Boost is likely to increase in the future. I will take it in myself.
By the way, major packages have undergone major updates over the last few months. The contents that I was personally interested in are as follows.
In addition to adding high-impact features, bag fix is also in progress. It's hard to chase, but it's just a happy scream.