I made a convenient tool colt for writing application settings such as machine learning. Briefly, colt is a tool for writing settings like ʻAllenNLP`. I wrote "machine learning" in the title, but I think it can be used to set up many applications, not just machine learning.

sample code

Introduction

When experimenting with machine learning models, we often see that ʻargparse and [Hydra`](https://hydra.cc/) are used to manage hyperparameters. I think the problem with many of these existing parameter management tools is that when the model changes significantly, the parameter loading process also needs to change.

For example (a very aggressive example), SVC in scikit-learn. I intended to use generated / sklearn.svm.SVC.html? Highlight = svc # sklearn-svm-svc) and wrote a setting to read parameters such as C, kernel, class_weight in ʻargparse. ， [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn-ensemble-randomforestclassifier) Is it necessary to rewrite even the part? Also, if you want to set an ensemble model like [StackingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html?highlight=stacking#sklearn-ensemble-stackingclassifier) ， If you want to set the base classifier and meta classifier, you are wondering how to write the settings.

AllenNLP

One of the means to solve these problems is the ** Register function ** adopted by ʻAllenNLP`, which is a deep learning framework for natural language processing. There is.

Here, I will explain this ** Register function ** a little. If you know it, please skip it.

ʻAllenNLP` describes the settings in JSON format. The following are some of the sentence classification model settings:

    "model": {
        "type": "basic_classifier",
        "text_field_embedder": {
            "token_embedders": {
                "tokens": {
                    "type": "embedding",
                    "embedding_dim": 10,
                    "trainable": true
                }
            }
        },
        "seq2vec_encoder": {
           "type": "cnn",
           "num_filters": 8,
           "embedding_dim": 10,
           "output_dim": 16
        }
    },

Specify the class you want to use with type and set the parameters in the field of the same level. Let's also look at the code for basic_classifier and cnn. The setting items correspond to the arguments of the __init __ method:

@Model.register("basic_classifier")
class BasicClassifier(Model):
    def __init__(
        self,
        ...,
        text_field_embedder: TextFieldEmbedder,
        seq2vec_encoder: Seq2VecEncoder,
        ...,
    ) -> None:
    ...


@Seq2VecEncoder.register("cnn")
class CnnEncoder(Seq2VecEncoder):
    def __init__(self,
                 embedding_dim: int,
                 num_filters: int,
                 ngram_filter_sizes: Tuple[int, ...] = (2, 3, 4, 5),
                 conv_layer_activation: Activation = None,
                 output_dim: Optional[int] = None) -> None:

If you register classes with the decorator register, you can specify those classes from the settings. With ʻAllenNLP, you can write the settings of a class just by creating a class and register`. Here, this function is called ** Register function **. Since the Register function associates a class with its settings, it is not necessary to change the setting reading process according to the model change.

You can easily replace various components of the model from the settings. To change the type of seq2vec_encoder from cnn to lstm, simply rewrite the settings as follows (lstm is already provided in ʻAllenNLP`):

        "seq2vec_encoder": {
           "type": "lstm",
           "num_layers": 1,
           "input_size": 10,
           "hidden_size": 16
        }

Features of `colt`

colt is a tool to realize the same function as ** Register function ** of ʻAllenNLP. By using colt, you can easily make settings that are flexible and resistant to code changes like ʻAllenNLP. It also implements some features not found in ʻAllenNLP` to make it easier to use in more cases.

Register function

Here is an example of using colt:

import typing as tp
import colt

@colt.register("foo")
class Foo:
    def __init__(self, message: str) -> None:
        self.message = message

@colt.register("bar")
class Bar:
    def __init__(self, foos: tp.List[Foo]) -> None:  # ---- (*)
        self.foos = foos

config = {
    "@type": "bar",  # `@type`Specify the class with
    "foos": [
        {"message": "hello"},  #The type here is(*)Inferred from the type hint of
        {"message": "world"},
    ]
}

bar = colt.build(config)  #Build object from config

assert isinstance(bar, Bar)

print(" ".join(foo.message for foo in bar.foos))  # => "hello world"

Register the class with colt.register ("<class identifier> "). On the setting side, describe in the format {"@type ":" <class identifier> ", (argument) ...}.

When building an object from a setting, call colt.build (<setting dict>).

If there is no @ type field in the setting and the type hint is written in the corresponding argument, the object will be created based on the type hint. In the above example, the argument foos of Bar is given the type hint List [Foo], so the contents of foos in config are converted to objects of the Foo class.

Type hints are not always required for colt. If you do not use type hints, write @ type without omitting it.


config = {
    "@type": "bar",
    "foos": [
        {"@type": "bar", "message": "hello"},
        {"@type": "bar", "message": "world"},
    ]
}

If there is no @ type or type hint, it is simply treated as dict.

Import function

You can also use colt for existing models included in scikit-learn etc. If the name specified by @ type is not registered, it will be imported automatically.

The following is an example of using StackingClassifier in scikit-learn:

import colt

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

config = {
    "@type": "sklearn.ensemble.StackingClassifier",
    "estimators": [
        ("rfc", { "@type": "sklearn.ensemble.RandomForestClassifier",
                  "n_estimators": 10 }),
        ("svc", { "@type": "sklearn.svm.SVC",
                  "gamma": "scale" }),
    ],
    "final_estimator": {
      "@type": "sklearn.linear_model.LogisticRegression",
      "C": 5.0,
    },
    "cv": 5,
}

X, y = load_iris(return_X_y=True)
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

model = colt.build(config)
model.fit(X_train, y_train)

valid_accuracy = model.score(X_valid, y_valid)
print(f"valid_accuracy: {valid_accuracy}")

In the above example, the model described in config can be replaced if it has the API of scikit-learn. For example, to grid search LGBMClassifier with GridSearchCV:

config = {
    "@type": "sklearn.model_selection.GridSearchCV",
    "estimator": {
        "@type": "lightgbm.LGBMClassifier",
        "boosting_type": "gbdt",
        "objective": "multiclass",
    },
    "param_grid": {
        "n_estimators": [10, 50, 100],
        "num_leaves": [16, 32, 64],
        "max_depth": [-1, 4, 16],
    }
}

About reading from the configuration file

colt does not provide a function to read settings from a file. If you want to read the settings from a file, convert your favorite format such as JSON / Jsonnet or YAML to dict and pass it to colt.

Other detailed functions

Module import

If you are registering in multiple different files, all the classes to be used at the time of colt.build must be imported. colt can use colt.import_modules to recursively import multiple modules.

For example, consider the following file structure:

.
|-- main.py
 `- models
    |-- __init__.py
    |-- foo.py
     `- bar.py

Let's say that models / foo.py and models / bar.py have Foo and Bar classes register respectively, and main.py does colt.build. .. Use colt.import_modules (["<module name> ", ...]) in main.py as follows.

`main.py`


colt.import_modules(["models"])
colt.build(config)

If you pass a list of module names to colt.import_modules, each module and below will be imported recursively. In the above example, we passed [" models "] as an argument, so all the modules under the models module will be imported and Foo, Bar will be available.

Positional argument

When describing positional arguments in the settings, specify * as the key and pass a list (or tuple) of positional arguments as the value.

@colt.register("foo")
class Foo:
    def __init__(self, x, y):
        ...

config = {"@type": "foo", "*": ["x", "y"]}

Specifying the constructor

By default, colt builds an object by passing class arguments to __init__. If you want to create an object from a method other than __init__, you can specify:

@colt.register("foo", constructor="build")
class FooWrapper:
    @classmethod
    def build(cls, *args, **kwargs) -> Foo:
        ...

This is convenient when you want to use it as a wrapper for another class.

Meta key change

Special keys such as @ type and*used by colt can be changed. For example, to change @ type to @ and * to +, specify it as an argument to colt.build:

colt.build(config, typekey="@", argskey="+")

If you want to keep the common settings, use ColtBuilder.

builder = colt.ColtBuilder(typekey="@", argskey="+")
builder(config_one)
builder(config_two)

Example of use with kaggle Titanic

I tried kaggle's Titanic competition using colt.

https://github.com/altescy/colt/tree/master/examples/titanic

From creating features to modeling using pdpipe and scikit-learn Most of the processing from learning to evaluation can be set. All settings are described as Jsonnet below configs. I hope it will be helpful when using colt.

in conclusion

We introduced the functions and usage examples of colt. I hope it helps you when writing the settings.

Also, the functionality of colt is based on a great framework called ʻAllenNLP](https://allennlp.org/). [ʻAllenNLP is packed with useful ideas for many machine learning tasks as well as natural language processing, so if you are interested, please use it.

I made a tool that makes it convenient to set parameters for machine learning models.