I made a convenient tool colt
for writing application settings such as machine learning.
Briefly, colt
is a tool for writing settings like ʻAllenNLP`.
I wrote "machine learning" in the title, but I think it can be used to set up many applications, not just machine learning.
When experimenting with machine learning models, we often see that ʻargparse and [
Hydra`](https://hydra.cc/) are used to manage hyperparameters.
I think the problem with many of these existing parameter management tools is that when the model changes significantly, the parameter loading process also needs to change.
For example (a very aggressive example), SVC
in scikit-learn
. I intended to use generated / sklearn.svm.SVC.html? Highlight = svc # sklearn-svm-svc) and wrote a setting to read parameters such as C
, kernel
, class_weight
in ʻargparse. , [
RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn-ensemble-randomforestclassifier) Is it necessary to rewrite even the part? Also, if you want to set an ensemble model like [
StackingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html?highlight=stacking#sklearn-ensemble-stackingclassifier) , If you want to set the base classifier and meta classifier, you are wondering how to write the settings.
AllenNLP
One of the means to solve these problems is the ** Register function ** adopted by ʻAllenNLP`, which is a deep learning framework for natural language processing. There is.
Here, I will explain this ** Register function ** a little. If you know it, please skip it.
ʻAllenNLP` describes the settings in JSON format. The following are some of the sentence classification model settings:
"model": {
"type": "basic_classifier",
"text_field_embedder": {
"token_embedders": {
"tokens": {
"type": "embedding",
"embedding_dim": 10,
"trainable": true
}
}
},
"seq2vec_encoder": {
"type": "cnn",
"num_filters": 8,
"embedding_dim": 10,
"output_dim": 16
}
},
Specify the class you want to use with type
and set the parameters in the field of the same level.
Let's also look at the code for basic_classifier
and cnn
. The setting items correspond to the arguments of the __init __
method:
@Model.register("basic_classifier")
class BasicClassifier(Model):
def __init__(
self,
...,
text_field_embedder: TextFieldEmbedder,
seq2vec_encoder: Seq2VecEncoder,
...,
) -> None:
...
@Seq2VecEncoder.register("cnn")
class CnnEncoder(Seq2VecEncoder):
def __init__(self,
embedding_dim: int,
num_filters: int,
ngram_filter_sizes: Tuple[int, ...] = (2, 3, 4, 5),
conv_layer_activation: Activation = None,
output_dim: Optional[int] = None) -> None:
If you register classes with the decorator register
, you can specify those classes from the settings.
With ʻAllenNLP, you can write the settings of a class just by creating a class and
register`.
Here, this function is called ** Register function **.
Since the Register function associates a class with its settings, it is not necessary to change the setting reading process according to the model change.
You can easily replace various components of the model from the settings.
To change the type
of seq2vec_encoder
from cnn
to lstm
, simply rewrite the settings as follows (lstm
is already provided in ʻAllenNLP`):
"seq2vec_encoder": {
"type": "lstm",
"num_layers": 1,
"input_size": 10,
"hidden_size": 16
}
colt
colt
is a tool to realize the same function as ** Register function ** of ʻAllenNLP. By using
colt, you can easily make settings that are flexible and resistant to code changes like ʻAllenNLP
.
It also implements some features not found in ʻAllenNLP` to make it easier to use in more cases.
Here is an example of using colt
:
import typing as tp
import colt
@colt.register("foo")
class Foo:
def __init__(self, message: str) -> None:
self.message = message
@colt.register("bar")
class Bar:
def __init__(self, foos: tp.List[Foo]) -> None: # ---- (*)
self.foos = foos
config = {
"@type": "bar", # `@type`Specify the class with
"foos": [
{"message": "hello"}, #The type here is(*)Inferred from the type hint of
{"message": "world"},
]
}
bar = colt.build(config) #Build object from config
assert isinstance(bar, Bar)
print(" ".join(foo.message for foo in bar.foos)) # => "hello world"
Register the class with colt.register ("<class identifier> ")
.
On the setting side, describe in the format {"@type ":" <class identifier> ", (argument) ...}
.
When building an object from a setting, call colt.build (<setting dict>)
.
If there is no @ type
field in the setting and the type hint is written in the corresponding argument, the object will be created based on the type hint.
In the above example, the argument foos
of Bar
is given the type hint List [Foo]
, so the contents of foos
in config
are converted to objects of the Foo
class.
Type hints are not always required for colt
.
If you do not use type hints, write @ type
without omitting it.
config = {
"@type": "bar",
"foos": [
{"@type": "bar", "message": "hello"},
{"@type": "bar", "message": "world"},
]
}
If there is no @ type
or type hint, it is simply treated as dict
.
You can also use colt
for existing models included in scikit-learn
etc.
If the name specified by @ type
is not registered
, it will be imported automatically.
The following is an example of using StackingClassifier
in scikit-learn
:
import colt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
config = {
"@type": "sklearn.ensemble.StackingClassifier",
"estimators": [
("rfc", { "@type": "sklearn.ensemble.RandomForestClassifier",
"n_estimators": 10 }),
("svc", { "@type": "sklearn.svm.SVC",
"gamma": "scale" }),
],
"final_estimator": {
"@type": "sklearn.linear_model.LogisticRegression",
"C": 5.0,
},
"cv": 5,
}
X, y = load_iris(return_X_y=True)
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
model = colt.build(config)
model.fit(X_train, y_train)
valid_accuracy = model.score(X_valid, y_valid)
print(f"valid_accuracy: {valid_accuracy}")
In the above example, the model described in config
can be replaced if it has the API of scikit-learn
.
For example, to grid search LGBMClassifier
with GridSearchCV
:
config = {
"@type": "sklearn.model_selection.GridSearchCV",
"estimator": {
"@type": "lightgbm.LGBMClassifier",
"boosting_type": "gbdt",
"objective": "multiclass",
},
"param_grid": {
"n_estimators": [10, 50, 100],
"num_leaves": [16, 32, 64],
"max_depth": [-1, 4, 16],
}
}
colt
does not provide a function to read settings from a file.
If you want to read the settings from a file, convert your favorite format such as JSON / Jsonnet or YAML to dict
and pass it to colt
.
If you are register
ing in multiple different files, all the classes to be used at the time of colt.build
must be imported.
colt
can use colt.import_modules
to recursively import multiple modules.
For example, consider the following file structure:
.
|-- main.py
`- models
|-- __init__.py
|-- foo.py
`- bar.py
Let's say that models / foo.py
and models / bar.py
have Foo
and Bar
classes register
respectively, and main.py
does colt.build
. ..
Use colt.import_modules (["<module name> ", ...])
in main.py
as follows.
main.py
colt.import_modules(["models"])
colt.build(config)
If you pass a list of module names to colt.import_modules
, each module and below will be imported recursively.
In the above example, we passed [" models "]
as an argument, so all the modules under the models
module will be imported and Foo
, Bar
will be available.
When describing positional arguments in the settings, specify *
as the key and pass a list (or tuple) of positional arguments as the value.
@colt.register("foo")
class Foo:
def __init__(self, x, y):
...
config = {"@type": "foo", "*": ["x", "y"]}
By default, colt
builds an object by passing class arguments to __init__
.
If you want to create an object from a method other than __init__
, you can specify:
@colt.register("foo", constructor="build")
class FooWrapper:
@classmethod
def build(cls, *args, **kwargs) -> Foo:
...
This is convenient when you want to use it as a wrapper for another class.
Special keys such as @ type
and*
used by colt
can be changed.
For example, to change @ type
to @
and *
to +
, specify it as an argument to colt.build
:
colt.build(config, typekey="@", argskey="+")
If you want to keep the common settings, use ColtBuilder
.
builder = colt.ColtBuilder(typekey="@", argskey="+")
builder(config_one)
builder(config_two)
I tried kaggle's Titanic competition using colt
.
https://github.com/altescy/colt/tree/master/examples/titanic
From creating features to modeling using pdpipe
and scikit-learn
Most of the processing from learning to evaluation can be set.
All settings are described as Jsonnet below configs
. I hope it will be helpful when using colt
.
We introduced the functions and usage examples of colt
.
I hope it helps you when writing the settings.
Also, the functionality of colt
is based on a great framework called ʻAllenNLP](https://allennlp.org/). [ʻAllenNLP
is packed with useful ideas for many machine learning tasks as well as natural language processing, so if you are interested, please use it.
Recommended Posts