Introduction

The progress of tools around machine learning is so fast that there are many experiences that new functions have been implemented without confirmation for several months. It seems that CatBoost was able to process even text items inside the model before I knew it. Try CatBoost's text column specification function

When I looked it up, it seems that it was added from v0.22 released on 3/3/2020. I don't know ... lol I was wondering what kind of processing would be done, so I immediately investigated it.

environment

Windows10
Python 3.7.4
catboost 0.22 --Only GPU supports text item processing, CPU does not support

reference

See below for details, excerpted here as appropriate.

-Reference -Tutorial

How to use

Data preparation

As with the tutorial, use rotten_tomatoes as data. A dataset with a mixture of numeric items, category items, and text items with rating_10 as the objective variable from 0 to 10.

from catboost import Pool, CatBoostClassifier
from catboost.datasets import rotten_tomatoes
from sklearn.metrics import accuracy_score

#Category item list
list_cat_features = ['rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic', 'publisher']

#Text item list
list_text_features = ['synopsis', 'genre', 'director', 'writer', 'review']

def get_processed_rotten_tomatoes():
    train, test = rotten_tomatoes()
    
    def fill_na(df, features):
        for feature in features:
            df[feature].fillna('', inplace=True)

    def preprocess_data_part(data_part):
        #Since it is an operation check, the item that takes time to create features is drop
        data_part = data_part.drop(['id', 'theater_date', 'dvd_date', 'rating', 'date'], axis=1)
        
        fill_na(data_part, list_cat_features)
        fill_na(data_part, list_text_features)

        X = data_part.drop(['rating_10'], axis=1)
        y = data_part['rating_10']
        
        return X, y
    
    X_train, y_train = preprocess_data_part(train)
    X_test, y_test = preprocess_data_part(test)

    return X_train, X_test, y_train, y_test


#Divided into train and test
X_train, X_test, y_train, y_test = get_processed_rotten_tomatoes()

#Show only text items
X_train[list_text_features].head()

Setting method

Just pass the list of item names to text_features as with category_features.

The following classes and methods can take text_features as arguments.

CatBoost.fit()
CatBoostClassifier.fit()
Pool()

Here, set it to Pool ().


#train dataset
train_pool = Pool(
    X_train, 
    y_train, 
    cat_features=list_cat_features,
    text_features=list_text_features,
    feature_names=list(X_train)
)

#test dataset
test_pool = Pool(
    X_test, 
    y_test, 
    cat_features=list_cat_features,
    text_features=list_text_features,
    feature_names=list(X_test)
)

catboost_default_params = {
    'iterations': 1000,
    'learning_rate': 0.03,
    'eval_metric': 'Accuracy',
    'task_type': 'GPU',  # 'CPU'Is not supported, CatBoostError will occur
    'random_seed': 0, 
    'verbose': 100

}

#Multi-class classification
clf = CatBoostClassifier(**catboost_default_params)
clf.fit(train_pool)

Evaluation index

y_pred = clf.predict(X_test)
print(f"accuracy = {accuracy_score(y_test, y_pred):.4f}")

accuracy = 0.4699

Since random_seed is not set in the tutorial, it does not exactly match ʻaccuracy` in the tutorial, but the result is almost the same. According to the verification result of the tutorial, when the text feature is dropped, it is 0.4562, so the accuracy is improved by several points.

Commentary

When converting from a text item to a numeric item for the item specified by text_features, three elements can be set.

argument

tokenizers [doc]
Define the settings for tokenization.
Delimiter and lowercasing can be specified.
dictionaries [doc]
Define settings for converting tokens into a dictionary.
It is possible to specify word unit or character unit, n of n-gram, etc.
feature_calcers [doc]
--Setting how to process dictionaries.
Bag of words
NaiveBayes
BM25

Processed image (tokenizer)

`Before tokenizers`


['cats so cute :)',
 'mouse skare ...',
 'cat defeated mouse',
 'cute : mice gather army !',
 'army mice defeated cat :(',
 'cat offers peace',
 'cat skared :(',
 'cat mouse live peace :)']

`After tokenizers (split with half-width space as delimiter)`


[['cat', 'so', 'cute', ':)'],
 ['mouse', 'skare', '...'],
 ['cat', 'defeat', 'mouse'],
 ['cute', ':', 'mice', 'gather', 'army', '!'],
 ['army', 'mice', 'defeat', 'cat', ':('],
 ['cat', 'offer', 'peace'],
 ['cat', 'skare', ':('],
 ['cat', 'mouse', 'live', 'peace', ':)']]

It is possible to specify the above combination with feature_processing. Those settings are passed to catboost_params as an argument as a dict.

Initial setting

`Initial setting`


{
    "tokenizers" : [{
        "tokenizer_id" : "Space",  # "Space"Define tokenizers by name
        "separator_type" : "ByDelimiter",  #Split by delimiter
        "delimiter" : " "  #Delimiter is a half-width space
    }],

    "dictionaries" : [{
        "dictionary_id" : "BiGram",  # "BiGram"Define dictionaries by the name of
        "max_dictionary_size" : "50000", 
        "occurrence_lower_bound" : "3",
        "gram_order" : "2"  # n-gram n=2
    }, {
        "dictionary_id" : "Word",  # "Word"Define dictionaries by the name of
        "max_dictionary_size" : "50000",
        "occurrence_lower_bound" : "3",
        "gram_order" : "1"  # n-gram n=1
    }],

    "feature_processing" : {
        "default" : [{  # tokenizers, dictionaries, feature_Define a combination of calcers
            "dictionaries_names" : ["BiGram", "Word"],
            "feature_calcers" : ["BoW"],
            "tokenizers_names" : ["Space"]
        }, {
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["NaiveBayes"],
            "tokenizers_names" : ["Space"]
        }],
    }
}

https://catboost.ai/docs/references/text-processing__test-processing__default-value.html

Extensibility is high because various combinations can be defined, but there are many choices of setting values, so refer to the reference.

in conclusion

OneHotEncoding has been unnecessary for some time due to the internal processing of category items. Now that we can handle text items, we have included text items. It is now possible to create a baseline model without creating features even with table data. LightGBM is the most used baseline model in Kaggle these days, Cat Boost is likely to increase in the future. I will take it in myself.

By the way, major packages have undergone major updates over the last few months. The contents that I was personally interested in are as follows.

scikit-learn v0.22（2019/12） --New implementation of stacking --New implementation of permutation importance --Supports visualization of ROC AUC
pandas v1.0（2020/01） --Unify missing values to pd.NA --Adding a string type to pandas.dtype

In addition to adding high-impact features, bag fix is also in progress. It's hard to chase, but it's just a happy scream.

Before I knew it, CatBoost had evolved so much that pre-processing of text items became unnecessary.

Introduction

environment

reference

How to use

Data preparation

Setting method

Evaluation index

Commentary

argument

Processed image (tokenizer)

Before tokenizers

After tokenizers (split with half-width space as delimiter)

Initial setting

Initial setting

in conclusion

`Before tokenizers`

`After tokenizers (split with half-width space as delimiter)`

`Initial setting`