Introduction

I built a data analysis environment with Kedro + MLflow + Github Actions, so I wrote my impressions.

background

** = "Issues when creating a notebook that is all in one file in a local environment for each experiment (lightgbm_02_YYYYMMDD.ipynb, etc.)" **

You end up with a huge and complex notebook
Preprocessing, model learning, model evaluation ...
Difficult to divide in charge (although in many cases all will be done alone)
Maintenance is spicy
→ If you divide by processing, you will not understand the dependency well this time
Code review is painful
Notebooks are difficult to diff
Notebooks can't be code formatter or checkered
Experiment management is difficult
I want to list them (it's hard to open and remember each notebook)
→ It is troublesome to maintain the list manually (the more trials there are)
It doesn't work or the result changes in another person's environment (when recreating from a clean environment)
If you take over the matter from a person and clone the master, it will not work
Results depend on local uncommitted data

What i did

Introduced Kedro as a pipeline tool

### What is Kedro? ・ Introduction method * (Reference) [Introduction to Machine Learning Pipeline with Kedro](https://qiita.com/noko_qii/items/2395d3a3dbcd9410e5e7)

Good thing

By defining the node (, data In / Out) / pipeline first, it was easy to divide the person in charge + easy to maintain.
It was good to be able to match the In / Out recognition for each process first.
The naming convention was decided first and made known.
Easy to work with notebooks

$ kedro jupyter notebook --allow-root --port=8888 --ip=0.0.0.0 &

from kedro.framework.context import load_context
proj_path = '../../../' 
context = load_context(proj_path)
# df = catalog.load("XXX")
parameters = context.params

I was able to visualize the pipeline using kedor-viz
I was able to manage credential information in credential.yaml

The remaining challenges

I'm worried about when to pipeline (script)
Data scientists made more and more trial and error in notebooks → Data engineers were making pipelines from time to time, but the number of corrections was large and it was a burden.
I want to re-execute the pipeline from the middle (it seems possible, but it has not been investigated)
kedro-viz is not updated automatically (it loads after rebooting)
I want to cut out lib somewhere (assuming reading from notebook side as well as src side)
I want to execute jobs in parallel
In parameter search, you can link with optuna

Introduced MLflow as an experiment management tool

### What is MLflow? ・ Introduction method * (Reference) [Introduction to experiment management with MLflow](https://future-architect.github.io/articles/20200626/)

Good thing

See the link above
Even if I wrote it so that it would be skipped to MLflow, the entire notebook could be lost.
→ It may be better to manage it with a spreadsheet (, excel) when it is the first chaos, and move to MLflow when it gets solid to some extent.

The remaining challenges

Cooperation with Kedro
I want to automatically link information in Kedro's param eters.yaml and pipeline.py
Do you use journal versioning?

Introduced Github Actions as a CI tool

### What is Github Actions? ・ Introduction method * (Reference) [CI / CD to try with the new function "GitHub Actions" on GitHub](https://knowledge.sakura.ad.jp/23478/)

Good thing

Operation of master (main) branch is guaranteed
I was able to create a reproducible model (because the commit id also moves with a clear cross section in a beautiful environment every time)

The remaining challenges

Every build is heavy. Should I make good use of the cache?
Heavy learning ・ Consider the configuration when using GPU

Other

Introduced code formatter and checker
(Reference) Execute formatter at pre-commit
Checks run at the timing of commit, so it is important to introduce from the beginning
If you are using a container, put it in the container with Git (because Python is required)
Consider how to use it properly with the check on the CI side
Made into a web application with AWS Elastic Beanstalk
I wanted to build with S3 + API Gateway + Lambda because serverless was good, but I gave up due to file size limit
You can use EFS, but it's a bit like you have to create a VPC environment.
If it's a serverless container (or just want you to sleep when there is no request), then Cloud Run or GAE with GCP?

in conclusion

There are still many challenges, but I would like to continue trying various things in order to quickly and surely run the machine learning cycle while accompanying the data scientist.
By the way, the content of this article was something I tried while playing with my friends when I developed a model for predicting horse racing results. The code will probably be published once it's a little cleaner. (Slightly changed from Kedro's default directory structure.)

Build a data analysis environment with Kedro + MLflow + Github Actions

Introduction

background

What i did

Introduced Kedro as a pipeline tool

Good thing

The remaining challenges

Introduced MLflow as an experiment management tool

Good thing

The remaining challenges

Introduced Github Actions as a CI tool

Good thing

The remaining challenges

Other

in conclusion