(Since there were abnormally many omissions, I reviewed it 12/28)
Recently, I am using Kedro to manage workflow in ML model development for verification.
Kedro is one of the workflow management tools. It composes Pipeline
by wrapping and connecting python functions in a class called node
, and all input and output of node
are memory and storage called Data Catalog. It is characterized by being managed by a file system class that abstracts. io itself supports it by preparing a separate module called DataSet
.
When using kedro I wanted to save the features generated for learning and the features generated for inference as a whole, including the intermediate products, but it seems that there is a straightforward solution for how to save them separately. I couldn't find it, so I'll leave the method I tried this time here.
Suppose you cut the file structure according to the project template that kedro automatically generates. The version of kedro is 0.17.0.
There are two ways to prepare the Data Catalog, one is to define it in python code and the other is to describe it in yaml, and to set hooks and generate it based on that yaml when executing kedro. Since this time it is based on the latter, only the latter example will be described.
At runtime, Kedro will automatically find the yaml named conf/*/catalog *
in the root directory and go read it.
A Data Catalog is created based on the settings written in yaml.
titanic_train:
type: pandas.CSVDataSet
filepath: s3://competitions/titanic/train.csv
credentials: minio
outputs:
type: pandas.CSVDataSet
filepath: s3://competitions/titanic/outputs.csv
credentials: minio
versioned: true
The DataSet provides a method of versioning.
Versioning is enabled by inheriting the class AbstractVersionedDataSet, implementing the DataSet class, and setting versioned
to true
in catalog.yml
.
Unfortunately, this method only allows timestamps for version names. (Probably)
This is not the case when setting DataCatalog
in the code, but at least it seems that the format of the version cannot be changed from the description of yaml.
Furthermore, there seems to be no way to inject code-generated DataCatalog
when running using the CLI.
I want to version the generated features by a version name other than time (I want to separate them like train and test).
As a means to do this, you need to write a new data set for test in catalog.yml
, and also generate a separate test data for pipeline
.
When defining pipeline
, node
must be defined, and when defining node
, it is necessary to read the version and change the referenced dataset. It's the beginning of unreadable code.
On the other hand, of course, the logic of feature generation must be the same at the time of learning and at the time of inference, and I want to make it common in the code.
To avoid this, I want to version (or rather tag) the data with some value other than time.
This is a function implemented from 0.17.0.
You can set a placeholder in yaml and inject the value from the code. You can also inject by preparing a yaml that describes the mapping to the placeholder externally.
By using this, you can directly change the filepath
in catalog.yml
to achieve versioning.
I am hungry to realize versioning by changing the save path for each version.
This also forced all datasets to have a uniform version name, but you can change the version name according to the dataset.
TemplatedConfigLoader
is about the function of a simplified version of hydra.
hydra is one of yaml's configuration management tools. It is an OSS that enables you to inject values from the command line and structure yaml.
hydra requires a unique notation in hydra, and its writing style is a little incompatible with kedro, so it is convenient to have kedro prepare a template function.
Describes the process when DataCatalog
is prepared when the project is generated from the project template. This will help you understand where to apply the TemplatedConfigLoader
.
If you prepare cli etc. from the project template, the execution session of kedro is generated by the class called KedroSession
.
I will ignore the details because I can not explain it correctly, but at this time, refer to the instance of ProjectHooks
in hooks.py
under <project_name>/src /
and each hooks described To execute.
Of these, there are two hooks related to Data Catalog
, register_config_loader
and register_catalog
.
The former prepares ConfigLoader
and the latter prepares DataCatalog
.
Also, in register_catalog
, DataCatalog
is generated based on catalog.yml
loaded by using register_config_loader
.
From the above, you can see that if you replace ConfigLoader
in register_config_loader
with TemplatedConfigLoader
, you can dynamically change the save path.
I've found that implementing TemplatedConfigLoader
in register_config_loader
could dynamically change catalog.yml
.
Actually, I want to operate the version name with CLI etc., so I will accept the input.
So
--Set the variable corresponding to the version in the ProjectHooks
class
--In register_config_loader
, load the settings by TemplatedConfigLoader
. At this time, change to inject the value into catalog.yml
--Receive the version name from the command line with cli.py
and pass the version name to ProjectHooks
before creating a session
--Prepare a placeholder to receive the value of catalog.yml
from TemplatedConfigLoader
I will do that.
ProjectHooks
classclass ProjectHooks:
_mode: str = ''
@classmethod
def set_mode(cls, mode: str):
cls._mode = mode
The specification for Session
is to call the instance of hooks generated here. Since the singleton pattern is not adopted, it cannot be guaranteed that they are the same instance even if they are called. So I set a class variable for the version name. (The variable name is mode
because, in fact, I just want to change the data used according to the purpose of execution, not the correct versioning.)
register_config_loader
Implement the following instance method in ProjectHooks
.
@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader:
return TemplatedConfigLoader(conf_paths, globals_dict=dict(mode=self._mode))
This will cause the value of self._mode
to be assigned to the placeholder if it contains a placeholder called mode
.
cli.py
and pass the version name to Project Hooks
before creating the session@click.option(
"--run-mode", type=click.Choice(['train', 'inference'], case_sensitive=True), default="train"
)
def run(
...,
run_mode
):
...
from .hooks import project_hooks
project_hooks.set_mode(run_mode)
(Import is only placed in a place that is easy to see when writing, so please place it wherever you like)
catalog.yml
from TemplatedConfigLoader
outputs:
type: pandas.CSVDataSet
filepath: s3://competitions/titanic/${mode}/outputs.csv
credentials: minio
With the above changes, when kedro run --run-mode = train
is set, it is set as the directory train.
I can no longer control it with versioned
, but I think it's okay because it is a policy not to use at the time of this implementation.
I explained how to implement versioning by directly templated catalog.yml
.
If you want to change only the model, you can prepare a separate placeholder for the model and give some value.
As another means, I thought about implementing VersionedDataSet by myself so that it can be changed with the value of yaml, but in the end I made it this way because it can not be changed dynamically without preparing a placeholder.
It may not be available due to changes in the future, but I will do it once.
~~ I've been using it and I feel that it's a function that isn't quite critical, so I think there will be more functions in the future. (Individual impression) ~~
If you take a look at issues and PR, similar topics are progressing in WIP, so it seems that they will be added soon.
I checked it roughly, so please let me know if it doesn't work.
Recommended Posts