A story about developing a machine learning model while managing experiments and models with Azure Machine Learning + MLflow

This article is the 17th day article of "Request! Tips for developing on Azure using Python![PR] Microsoft Japan Advent Calendar 2020". I posted it later in the empty frame.

There is a service called Azure Machine Learning that covers a wide range around machine learning. Azure Machine Learning is a service that can handle almost all the steps required for machine learning model development, from dataset management to model development, experiment management, model management, and deployment, but in this article, Azure Machine We'll show you how to combine some of Learning's features, experiment management and model management, with any Python development environment that can connect to the Internet.

However, it does not use the features of Azure Machine Learning in its original state. We will use Azure Machine Learning as the back end for MLflow, the most widely used experiment management tool.

If you start developing a machine learning model in earnest, you may have trouble managing experiments and models, but I hope you can refer to it as an example at that time.

(Added on 2020/12/29) Added steps to deploy the model as an API. This depends on Azure Machine Learning.

environment

So far

I prepared a JupyterLab environment on a Linux machine with a GPU that I made for machine learning, and used it by connecting from the main PC.

I use Github to some extent for code management, but I'm ashamed to say that I didn't bother with experiment management, and I ended up managing experiments with high-para comments, notebook copy, and high-para insertion into file names. The management of the completed model is also sloppy, and I gave it a crazy name like "model_adam_lr_0001_lstm_3_layer_e_512_h_1024.model" ....

image.png

This is a diagram showing that just before submitting the master's thesis, I was a graduate student who had run out of room and made a rough experiment record. What I thought was that I posted all the files on GitHub and then deleted them all.

I only know that the softmax function is used in the final layer ... I think this is terrible ... "Model_adam_lr_0001_lstm_3_layer_e_512_h_1024.model" is now looking better ...

Because of this method, of course, "Oh, how was the high para at that time?", "Oh, I accidentally overwrote the existing file", "I lost the output dictionary", and so on.

from now on

By adding a tool that can manage experiments and models to the existing environment, I would like to change the conventional unscrupulous experiment management system. The verification itself is done in the Jupyter environment, but the method in this article should be able to be used in the same way in any Python environment connected to the Internet.

The most widely used experiment management tool is MLflow. https://mlflow.org/

MLflow seems to consist of a library used on the development environment side and a server for recording experiments and models. The server provides the REST API, and the library uses this API. Building the MLflow server locally and using it locally is the easiest way to use it, but you don't want the experiments and models under your control to collapse with your development environment. If you build a separate server, you can avoid collapsing together, but even if you use a container, it is very troublesome to prepare the authentication system, manage the DB that is the back end of the MLflow server, and manage the entire system.

I was looking for MLflow as a managed service, and now Databricks offers a perfect managed service for MLflow as part of it, and Azure Machine Learning accepts MLflow API calls. It turns out that it has a compatible function. https://databricks.com/jp/product/managed-mlflow https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-use-mlflow

First of all, experiment management and model management, but since there is also a desire to finally deploy a machine learning model as an API or connect it to MLOps, it is possible to deploy a model at the production level with strong cooperation with container services. Adopt a style that uses Azure Machine Learning as a server for MLflow.

image.png

The aim is like this.

Azure Machine Learning has a Python SDK, so you can use it for experiment management and model management, but I dare to use MLflow. Compared to Azure Machine Learning and other experiment management tools, MLflow is more popular and has a wealth of know-how. The abundance of know-how increases the possibility of solving problems when problems occur and when questions arise about how to use them, and when introducing an experiment management tool for the first time, the know-how accumulated on this net and in the predecessors. Will be a big attraction.

Service understanding

Azure Machine Learning

azure-machine-learning-taxonomy.png

This image in the document [^ 1] is easy to understand, so I will understand it from here.

It seems that Workspace exists as a top-level resource, and various components such as Experiment for experiment management, Registered models for model management, and Deployment endpoints for deployment exist under it.

Of these, Experiment is the unit that manages one experiment, and Run under Experiment is the unit that manages the execution of each experiment. The following is a collection of various files, metrics, logs, etc. in one Run.

Pipelines is a component that handles processing in cooperation with external services, and since it will not be used this time, details will be omitted.

Datasets is a component that manages datasets created from data sources, and I will not use them this time either, but from the viewpoint of reproducibility, it is better to record datasets as well, so I would like to incorporate them someday.

Registered models are literally model management components.

There is a GUI environment called Azure Machine Learning Studio, which allows you to browse managed experiments and models in the GUI.

MLflow

According to the document [^ 2], MLflow consists of the following four components.

Tracking is responsible for recording parameters and metrics and managing execution history, so the "experiment management" that we want to do this time is the territory of Tracking.

Each experiment managed by Tracking is managed in units called experiments, and experiments have a unit called run under them. Files, logs, metrics, etc. generated in each experiment execution are recorded in run. Of these, the file is called artifact. In my case, I often make backups of logs and output images of graphs in addition to the model itself, so I'm grateful to be able to manage such things as well.

The component that was formerly called the MLflow server is correctly called the MLflow Tracking Server, which provides APIs as well as GUI-based experiment and model reference / management functions.

It records environmental information about how Projcts performs the learning process. In short, it is a component that makes it possible to execute an experiment anywhere by putting together what is necessary to ensure the reproducibility of the experiment. It can be said that it is a function that abstracts "machine learning experiments". It seems to be the key to ensuring reproducibility.

Models are responsible for managing the trained models, and this component is responsible for "model management".

Relationship between Azure Machine Learning and MLflow

The Azure Machine Learning workspace provides an MLflow compatible REST API. Strictly speaking, Azure Machine Learning has an endpoint that accepts MLflow Tracking Server compatible API calls, and when you access and record this endpoint in the MLflow library, it is mapped to Azure Machine Learning features. It has a structure called. The method of saving records is not exactly the same as MLflow, but the operations performed by the side who writes Python and develops machine learning models are basically the same for both genuine MLflow and Azure Machine Learning.

It does not provide the exact same GUI as MLflow, and if you want to manage the experiment with the GUI, you need to use Azure Machine Learning Studio.

In Azure Machine Learning, Experiment is the component responsible for experiment management, and this component corresponds to MLflow Tracking. In MLflow, the unit called Experiment, which represents the entire experiment of a model, contains Runs equivalent to one experiment execution for the number of experiments. This structure is the same in Azure Machine Learning, and it works.

The management of environment information handled by MLflow Projects seems to be the part of Experiment of Azure Machine Learning, but unlike Tracking, various files and settings that make up one project defined by MLflow Projects are not stored somewhere. There seems to be no. Azure Machine Learning allows you to manage multiple computing resources and has the ability to decide which computing resource to run the experiment on, so that you can run the experiment in any environment. Scripts, dependencies, etc. need to be grouped together, and it seems that this "grouped experiment" is compatible with MLflow Projects projects.

Project management looks good on GitHub as well.

Model management for MLflow Models is handled by Registered models, which support MLflow-compatible model definitions.

I've tried

What to do in advance

If you want to use an instance with GPU, you must apply for quota (limit of available resources) from the support request in advance. Also, since different frames are set for normal virtual machines and instances under Azure Machine Learning, we will not use it in this article, but if you want to use GPU instances with Azure Machine Learning, you need to apply for quotas there.

(Deploying Data Science VM)

My GPU-equipped machine isn't working very well and it's a good opportunity to deploy a VM with GPU on Azure and move to it temporarily. Since the on-premise GPU machine is romantic, I'm thinking of returning after rebuilding, but it's really painful to lose data, so this time I will use the Jupyter Lab environment on the cloud.

There is a VM image called Data Science Virtual Machine that can use JupyterLab immediately after deployment, so use this. It only takes 2 clicks to build a typical Python environment ....

image.png

I'm using an NC v3 instance with Tesla V100. It's a good time to be able to use a 1.4 million yen GPU for a few hundred yen an hour.

Note that the availability option was set to zone 1 by default, but the Data Science Virtual Machine doesn't seem to support the availability option and I got an error no matter which zone I specified, so "Infrastructure Redundancy" Is not required ". The code will be left to GitHub, and the experiment records and models will be left to Azure Machine Learning, so the computing environment can be dead at worst.

Also, since JupyterHub is used for login, the user name and password are more convenient for the account, so the authentication type is changed to "password".

The disk and network settings will remain at their defaults. Disk capacity may need to be expanded depending on the data handled and the size of the model.

I've enabled automatic shutdown because accidentally leaving the VM on can be very painful for my wallet.

https://<VM IP>:8000

You can connect to JupyterHub with, so if you exit the error screen specific to the Oreole certificate and log in with the administrator account you set, the familiar Jupyter Notebook environment will be displayed.

https://<VM IP>:8000/user/<username>/tree?

You see a URL like, but here

https://<VM IP>:8000/user/<username>/lab

Then you will move to the JupyterLab environment. I remember that it is necessary to rewrite the settings of JupyterHub when starting JupyterLab by default, but this time it will be left as it is because it is troublesome.

image.png

Deploy Azure Machine Learning

Prepare a workspace for Azure Machine Learning.

image.png

There is no need to change the settings, and once you have decided on the resource group and workspace name, you can proceed to create it.

After the deployment is complete, move to the resource screen and click "Download config.json". This is a file that contains information for connecting to the Azure Machine Learning workspace created through the SDK. The contents are the subscription ID, resource group name, and workspace name. Manual input is sufficient, but this time we will use it.

image.png

You can use Azure Machine Learning Studio, which is a GUI, by clicking "Launch Studio".

image.png

It seems that you can operate various things from here. (This time, except for deploying, most of the operations are done with code, so just look at it.)

I'm motivated when the service has a clean UI, so I'm getting better.

Connect to Azure Machine Learning Workspace

Go to the JupyterLab environment of the Data Science Virtual Machine and create a notebook using a kernel called azureml_py36_pytorch. (If you create your own conda virtual environment, there seems to be no problem as long as you install azureml and MLflow)

First, connect to the workspace. You need to upload the config.json you downloaded earlier to the same directory as your notebook.

Since it will be used as an MLflow Tracking compatible server, MLflow and the MLflow compatible endpoint provided by the workspace will also be linked.

import mlflow
from azureml.core import Workspace

ws = Workspace.from_config()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

You will be asked to authenticate, so authenticate from the URL provided.

image.png

When I returned after completing the authentication in another window, the cell execution was completed and it was authenticated.

image.png

After that, we will basically use the MLflow library, so the parts specific to Azure Machine Learning should hardly appear except where you are checking the results of each step with the GUI. If you already know about MLflow, you may skip to deploy.

Preparing for the experiment

Using the Boston home price dataset, I created a model for predicting prices by building a neural network with PyTorch.

This time, I'm using AdaBelief as an optimization method. I just wanted to use it.

conda activate azureml_py36_pytorch
pip install adabelief-pytorch==0.2.0

Install the package as, or if the package installation is troublesome, use the Optimizer part.

optimizer = torch.optim.Adam(nn_model.parameters(), lr=0.001)

If you replace it with Adam, it will work.

import torch
from torch import nn
from torch.nn import functional as F
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import torch.utils.data as Data
from adabelief_pytorch import AdaBelief
import tqdm
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

#Hyperparameters

hidden_1 = 64
hidden_2 = 16
batch_size = 16
n_epochs = 20

#data set

boston = load_boston()
X_train, X_test = train_test_split(boston.data)
y_train, y_test = train_test_split(boston.target)

class BostonData(Data.Dataset):
    def __init__(self, X, y):
        self.targets = X.astype(np.float32)
        self.labels = y.astype(np.float32)
    
    def __getitem__(self, i):
        return self.targets[i, :], self.labels[i]

    def __len__(self):
        return len(self.targets)

train_dataset = BostonData(X_train, y_train)
test_dataset = BostonData(X_test, y_test)

train_loaded = Data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loaded = Data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

#model

class Model(nn.Module):
    def __init__(self, n_features, hidden_1, hidden_2):
        super(Model, self).__init__()
        self.linear_1 = nn.Linear(n_features, hidden_1)
        self.linear_2 = nn.Linear(hidden_1, hidden_2)
        self.linear_3 = nn.Linear(hidden_2, 1)

    def forward(self, x):
        y = F.relu(self.linear_1(x))
        y = F.relu(self.linear_2(y))
        y = self.linear_3(y)
        return y

n_features = X_train.shape[1]
nn_model = Model(n_features, hidden_1, hidden_2)

#Optimization method and loss function

criterion = nn.MSELoss(size_average=False)
optimizer = AdaBelief(
    nn_model.parameters(),
    lr=1e-3,
    eps=1e-16,
    betas=(0.9,0.999),
    weight_decouple = True,
    rectify = False,
    print_change_log = False
)

#Learning

losses = []
for epoch in range(n_epochs):
    progress_bar = tqdm.notebook.tqdm(train_loaded, leave=False)
    losses = []
    total = 0
    for inputs, target in progress_bar:
        inputs.to(device)
        target.to(device)
        optimizer.zero_grad()

        y_pred = nn_model(inputs)
        loss = criterion(y_pred, torch.unsqueeze(target,dim=1))

        loss.backward()
        
        optimizer.step()
                
        losses.append(loss.item())
        total += 1

    epoch_loss = sum(losses) / total
    losses.append(epoch_loss)
                
    mess = f"Epoch #{epoch+1} Loss: {losses[-1]}"
    tqdm.tqdm.write(mess)

plt.plot(losses)

Experiment management with MLflow Tracking + Azure Machine Learning

We'll use the Azure Machine Learning Experiment component as the server for MLFlow Tracking, but as we've said repeatedly, developers will use MLflow.

Define one experiment.

experiment_name = 'boston_nn_experiment'
mlflow.set_experiment(experiment_name)

If you want to track the metric, record it inside with mlflow.start_run (): using mlflow.log_metric as shown below.

with mlflow.start_run():
    mlflow.log_metric('Loss', 0.03)

You can also use the mlflow.log_metrics function to record all at once in a dictionary type.

Use mlflow.log_param or mlflow.log_params to record hyperparameters. It is similar to metric recording that the singular form is one by one, and the plural form is batch in the dictionary.

params = {
    "hidden_1":hidden_1,
    "hidden_2":hidden_2,
    "batch_size":batch_size,
    "n_epochs":n_epochs
        }
with mlflow.start_run():
    mlflow.log_metrics(params)

When actually executing and recording the experiment, it is necessary to write the execution part under the code so far, or to sandwich the execution part between mlflow.start_run () and mlflow.end_run ().

At this level, it would be nice if the existing code could be easily rewritten.

Record the generated file with mlflow.log_artifact. Until now, it was written directly to the local directory, but after that, MLflow and Azure Machine Learning are responsible for managing the file, so leaving the local file is an obstacle, so create a temporary directory and write the file there. I tried to make it disappear after recording.

fig = plt.figure()
plt.plot(losses)
with tempfile.TemporaryDirectory() as d:
    filename = 'plot.png'
    artifact_path = pathlib.Path(d) / filename
    print(artifact_path)
    fig.savefig(str(artifact_path))
    mlflow.log_artifact(str(artifact_path))

Use mlflow.pytorch.log_model to register the trained model. The argument artifact_path is the directory path where the model and its peripheral files are saved on the Azure Machine Learning side. The model object is specified in the first argument, and it seems that it will convert it to binary format without pickle or torch.save.

mlflow.pytorch.log_model(nn_model,artifact_path="model")

As you can see from the function name, there are many other things besides mlflow.pytorch.log_model. [^ 3]

The files recorded by log_model and log_artifact are managed together. Even if you use a library that MLflow does not support, you can record it for the time being by registering it as artifact.

By incorporating various mlflows so far, the code of the learning part has been rewritten as follows.

with mlflow.start_run():
    mlflow.log_params(params)

    losses = []
    for epoch in range(n_epochs):
        progress_bar = tqdm.notebook.tqdm(train_loaded, leave=False)
        losses = []
        total = 0
        for inputs, target in progress_bar:
            inputs.to(device)
            target.to(device)
            optimizer.zero_grad()
            
            y_pred = nn_model(inputs)
            loss = criterion(y_pred, torch.unsqueeze(target,dim=1))
    
            loss.backward()
            
            optimizer.step()
                        
            losses.append(loss.item())
            total += 1
    
        epoch_loss = sum(losses) / total
        losses.append(epoch_loss)
                    
        mess = f"Epoch #{epoch+1} Loss: {losses[-1]}"
        tqdm.tqdm.write(mess)

    mlflow.log_metric("Loss",losses[-1])
    mlflow.pytorch.log_model(nn_model,artifact_path="model")
    
    fig = plt.figure()
    plt.plot(losses)
    with tempfile.TemporaryDirectory() as d:
        filename = 'plot.png'
        artifact_path = pathlib.Path(d) / filename
        print(artifact_path)
        fig.savefig(str(artifact_path))
        mlflow.log_artifact(str(artifact_path))

When I open the experiment from Azure Machine Learning Studio, the experiment is certainly recorded.

image.png

Each run is also recorded.

image.png

Metrics are also recorded.

image.png

The image is also recorded properly.

image.png

The parameter was shown in the "view" of the Experiment, but not in the Run, but it was recorded in the raw JSON when I opened it. In this state, he is hard to see from the GUI, but I'm okay because there is no problem in hitting from the Python API of MLflow.

For the time being, you can now manage experiments and, to a lesser extent, model management.

Experiment management that ensures reproducibility by MLflow Projects + MLflow Tracking + Azure Machine Learning

The above series of experiment management processes are all premised on experiments on a notebook. After debugging, I think that it will be put together in a script file like train.py, but this is where MLflow Projects comes into play.

With MLflow Projects, you can run not only locally, but even remotely (such as other powerful resources in the cloud). For me personally, it seems to be useful when returning to an on-premise GPU machine, but if you are developing as a team, it seems important to be able to ensure the reproducibility of other people's experiments.

The mood is similar to building with source code and build config files.

Three files required for Project, referring to the code used earlier, conda.yaml output as artifact when recording the experiment, and the official sample [^ 4].

Create a.

In the build example above, train.py is the source code, conda.yaml is the file that describes the dependencies, and MLproject is the file that shows the detailed build settings.

train.py


import torch
from torch import nn
from torch.nn import functional as F
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import torch.utils.data as Data
from adabelief_pytorch import AdaBelief
import tqdm
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
import tempfile
import pathlib
import sys
import mlflow

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

#Hyperparameters

hidden_1 = int(sys.argv[1]) if len(sys.argv) > 1 else 64
hidden_2 = int(sys.argv[2]) if len(sys.argv) > 1 else 16
batch_size = int(sys.argv[3]) if len(sys.argv) > 1 else 16
n_epochs = int(sys.argv[4]) if len(sys.argv) > 1 else 20

params = {
    "hidden_1":hidden_1,
    "hidden_2":hidden_2,
    "batch_size":batch_size,
    "n_epochs":n_epochs
}

#data set

boston = load_boston()
X_train, X_test = train_test_split(boston.data)
y_train, y_test = train_test_split(boston.target)

class BostonData(Data.Dataset):
    def __init__(self, X, y):
        self.targets = X.astype(np.float32)
        self.labels = y.astype(np.float32)
    
    def __getitem__(self, i):
        return self.targets[i, :], self.labels[i]

    def __len__(self):
        return len(self.targets)

train_dataset = BostonData(X_train, y_train)
test_dataset = BostonData(X_test, y_test)

train_loaded = Data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loaded = Data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

#model

class Model(nn.Module):
    def __init__(self, n_features, hidden_1, hidden_2):
        super(Model, self).__init__()
        self.linear_1 = nn.Linear(n_features, hidden_1)
        self.linear_2 = nn.Linear(hidden_1, hidden_2)
        self.linear_3 = nn.Linear(hidden_2, 1)

    def forward(self, x):
        y = F.relu(self.linear_1(x))
        y = F.relu(self.linear_2(y))
        y = self.linear_3(y)
        return y

n_features = X_train.shape[1]
nn_model = Model(n_features, hidden_1, hidden_2)

#Optimization method and loss function

criterion = nn.MSELoss(size_average=False)
optimizer = AdaBelief(
    nn_model.parameters(),
    lr=1e-3,
    eps=1e-16,
    betas=(0.9,0.999),
    weight_decouple = True,
    rectify = False,
    print_change_log = False
)

#Learning

with mlflow.start_run():
    mlflow.log_params(params)

    losses = []
    for epoch in tqdm.tqdm(range(n_epochs)):
        losses = []
        total = 0
        for inputs, target in train_loaded:
            inputs.to(device)
            target.to(device)
            optimizer.zero_grad()
            
            y_pred = nn_model(inputs)
            loss = criterion(y_pred, torch.unsqueeze(target,dim=1))
    
            loss.backward()
            
            optimizer.step()
                        
            losses.append(loss.item())
            total += 1
    
        epoch_loss = sum(losses) / total
        losses.append(epoch_loss)
        mess = f"Epoch #{epoch+1} Loss: {losses[-1]}"

    mlflow.log_metric("Loss",losses[-1])
    mlflow.pytorch.log_model(nn_model,artifact_path="model")
    
    fig = plt.figure()
    plt.plot(losses)
    with tempfile.TemporaryDirectory() as d:
        filename = 'plot.png'
        artifact_path = pathlib.Path(d) / filename
        print(artifact_path)
        fig.savefig(str(artifact_path))
        mlflow.log_artifact(str(artifact_path))

The difference from running in a notebook is that hyperparameters can now be given from outside the script.

conda.yaml


channels:
- defaults
- conda-forge
- pytorch
dependencies:
- python=3.6.9
- pytorch=1.4.0
- torchvision=0.5.0
- pip
- pip:
  - mlflow
  - cloudpickle==1.6.0
name: mlflow-env

This brings the automatically generated file to the Azure Machine Learning Experiment. This time I was running in a local environment with a set of dependent packages, so I didn't get any particular errors, but if I run it remotely, I think I'll need adabelief-pytorch == 0.2.0. The automatic generation of conda.yml doesn't seem to be perfect.

MLproject


name: mlflow-env

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      hidden_1: {type: int, default: 64}
      hidden_2: {type: int, default: 16}
      batch_size: {type: int, default: 16}
      n_epochs: {type: int, default: 20}
    command: "python train.py {hidden_1} {hidden_2} {batch_size} {n_epochs}"

I've added a block to the MLproject to pass parameters.

entory_points is used when the execution is divided into multiple stages, such as preprocessing, learning 1 and learning 2, but this time there is no multi-stage execution, so only main.

Put the above files in the same directory.

When conducting an experiment, you can execute the experiment simply by connecting to the Azure Machine Learning workspace, defining the experiment name, defining the parameters, and writing a few settings as shown below.

import mlflow
from azureml.core import Workspace
import os

ws = Workspace.from_config()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

experiment_name = "pytorch-boston-project"
mlflow.set_experiment(experiment_name)

backend_config={"USE_CONDA": False}
params = {
    "hidden_1":32,
    "hidden_2":8,
    "batch_size":16,
    "n_epochs": 10
         }

local_env_run = mlflow.projects.run(uri=os.getcwd(), 
                                    parameters=params,
                                    backend = "azureml",
                                    use_conda=False,
                                    backend_config = backend_config)

change backend_config when running the experiment on a remote computing resource.

For example, when executing in a cluster named cpu-cluster under Azure Machine Learning, change as follows. [^ 5]

backend_config = {"COMPUTE": "cpu-cluster", "USE_CONDA": False}

Model management

The model has already been recorded at this point, but it is only recorded as one of the artifacts. Models can also be recorded in Registered models separately from Experiment, which is the management component of experiments in Azure Machine Learning, and recording models here makes it easy to containerize and deploy. You can register the model in Registered models every time you experiment, but since it also has a model version control function, it seems that you can register it when the experiment is successful and you have a model to put in the production environment.

You can record models in Registered models with mlflow.register_model.

with mlflow.start_run() as run:
    mlflow.log_params(params)

##Omission

    mlflow.log_metric("Loss",losses[-1])
    mlflow.pytorch.log_model(nn_model,artifact_path="model")
    
    model_uri = "runs:/{}/model".format(run.info.run_id)
    mlflow.register_model(model_uri, "PyTorchModel")

However, this method is not very comfortable on Azure Machine Learning because it is not associated with the experiment record. Therefore, use register_model in azureml.mlflow.

import azureml.mlflow

with mlflow.start_run() as run:
    mlflow.log_params(params)

##Omission

    mlflow.log_metric("Loss",losses[-1])
    mlflow.pytorch.log_model(nn_model,artifact_path="model")
    
    #model_uri = "runs:/{}/model".format(run.info.run_id)
    #mlflow.register_model(model_uri, "PyTorchModel")

    azureml.mlflow.register_model(run, name="PyTorchModel", path="model")

You can see how the model is registered in the state associated with the experiment record (experiment name and execution ID).

image.png

Deploy

Since it's a big deal, I'm thinking of trying to deploy the API.

When deploying the created machine learning model, one more effort is required from model registration, and it is necessary to write a script called entry script for loading the model and outputting the predicted value. [^ 6]

API deployment can be done from "Deploy" of the model registered in Registered model, and it will be deployed as a container to Azure Kubernetes Service or Azure Container Instance.

It seems that all the "artifacts" of the registered model are stored in the container in addition to the entry script, and the path to the deliverable is registered as an environment variable called " AZURE_MODEL_DIR ".

image.png

Problems that occurred

Regarding the deployment part, I was not in time for the posting of the first draft (2020/12/27), but the reason why I was not in time is that it took a lot of time to debug the script required for deployment called the entry script. (I didn't know how to look at the log ...)

I'm going to skip the process, but I was trying to load the model with mlflow.pytorch.load_model in the entry script, but this didn't work.

When I created a similar environment at hand and verified it, I got an error that some file was missing (I forgot to take a screenshot), but it seems to be annoying, so I gave up and saved the model that PyTorch has. I turned the steering in the direction to manage on the road.

However, it turns out that this method also causes problems when used for model loading in an inference environment.

torch.save(model, "model/model.pth")

You can save the model like this, so if you save this as mlflow.log_artifact, that's all you need to do ... I had a time when I was thinking that way ...

Simple at the loading stage

model = torch.load("model/model.pth")

Then, the problem arises when the environment in which the model is trained and the inference environment are different. This is a problem caused by storing the model in a form that contains environment-dependent information.

So what to do

model = ModelClass(<model_parameters>)
model.load_state_dict(torch.load("model/model.pth"))

It takes the form of creating a model from a class that defines the model once and overwriting the parameters of the trained model there.

There was a good commentary on this issue. If you want to know more, please click here. https://qiita.com/jyori112/items/aad5703c1537c0139edb

This way, in order to create an instance from the class that defines the model and the class of the model, the file that describes the hyperparameters must also be registered in Registered models, but that is honestly unpleasant. If you spit out hyperparameters as text, it makes no sense to record them using MLflow.

ONNX

That's why I decided to use ONNX.

ONNX is a format for storing machine learning models. In addition to being able to solve the above problem because it is a format that includes a model structure, it will be easier to use in various places, not limited to the API deployed this time, and the same procedure even if you use XGBoost or scikit-learn other than PyTorch, for example. I decided to adopt this because I can follow the above and three birds with one stone.

This is an example of "it will be easier to use not only in API but also in various places", but personally, I thought that the function that can read the ONNX format model and infer it in the SQL statement of the analysis platform seems to be interesting. https://docs.microsoft.com/ja-jp/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-predict

First, rewrite the part to be trained as follows and register the model output in ONNX format.

#Learning

with mlflow.start_run() as run:
    mlflow.log_params(params)

    #Omission

    mlflow.log_metric("Loss",losses[-1])
    
    with tempfile.TemporaryDirectory() as d:
        filename = 'model.onnx'
        artifact_path = pathlib.Path(d) / filename
        #Input sample
        valid_input = torch.randn(1, 13, requires_grad=True)
        # ONNX
        torch.onnx.export(model=nn_model,
                          args=valid_input,
                          f=str(artifact_path),
                          export_params=True, 
                          opset_version=11,  
                          input_names = ['input'],
                          output_names = ['output'],
                          dynamic_axes={'input' : {0 : 'batch_size'}, 'output' : {0 : 'batch_size'}})
        mlflow.log_artifact(str(artifact_path),artifact_path="model/"+filename)
        
    mlflow.pytorch.log_model(nn_model,artifact_path="model")
    azureml.mlflow.register_model(run, name="PyTorchModel", path="model/model.onnx")
    
    fig = plt.figure()
    plt.plot(losses)
    with tempfile.TemporaryDirectory() as d:
        filename = 'plot.png'
        artifact_path = pathlib.Path(d) / filename
        print(artifact_path)
        fig.savefig(str(artifact_path))
        mlflow.log_artifact(str(artifact_path))

image.png

When I run it, the onnx model is certainly registered.

Next, prepare an entry script.

The entry script requires a init function to load the model and arun function to describe what to do when a request comes to the API. See the documentation for details. https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-deploy-advanced-entry-script

Use onnxruntime when making inferences using the onnx format model.

As an aside, onnxruntime seems to have been developed by Microsoft and made into OSS. I didn't know. https://japan.zdnet.com/article/35129632/ https://github.com/microsoft/onnxruntime

entry.py


import json
import os
import numpy as np
import onnxruntime

def init():
    global model
    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.onnx')
    model = onnxruntime.InferenceSession(model_path)
    
def run(data):
    try:
        data= json.loads(data)
        data_array = [[data['CRIM'], data['ZN'], data['INDUS'], data['CHAS'], data['NOX'], data['RM'], data['AGE'], data['DIS'], data['RAD'], data['TAX'], data['PTRATIO'], data['B'], data['LSTAT']]]
        input_array = np.array(data_array, dtype=float).tolist() 
        pred = model.run(output_names=["output"], input_feed={"input": input_array})
        
        return {"prediction": float(pred[0])}
    except Exception as e:
        error = str(e)
        return {"error": error}

On the ONNX side, I / O is defined so that even if a lot of inputs come in at once, I / O can be done well, but this entry script accepts only one set of inputs in one request. (I was exhausted by debugging and it became troublesome)

The model is read by onnxruntime.InferenceSession (model_path), and inference is performed by pred = model.run (output_names = ["output "], input_feed = {"input ": input_array}).

Prepare conda.yaml to resolve the dependency of this entry script.

conda.yaml


channels:
- defaults
- conda-forge
- pytorch
dependencies:
- python=3.6.9
- pip
- numpy
- pip:
  - onnxruntime
  - azureml-defaults
  - azureml-sdk
name: boston-api

It's finally deployed.

image.png

If you want to do it from the command, you can use mlflow.azureml.deploy.

It is also mentioned in the official documentation. What's going on with the entry script ... https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-use-mlflow#deploy-and-register-mlflow-models https://www.mlflow.org/docs/latest/python_api/mlflow.azureml.html

Commands may be more convenient for applications such as updating datasets on a regular basis to update models and APIs. I just came up with the idea of ​​a combination of regular GitHub Actions execution and the ability to use remote computational resources for continuous model updates. I will verify it soon.

When the deployment is complete, it will be displayed on the endpoint page.

image.png

I will actually hit it.

import requests
import json

data = {
    "CRIM":1.00245,
    "ZN":0,
    "INDUS":8.12,
    "CHAS":0,
    "NOX":0.538,
    "RM":6.674,
    "AGE":87.3,
    "DIS":4.239,
    "RAD":4,
    "TAX":307,
    "PTRATIO":21,
    "B":380.23,
    "LSTAT":11.98
}
data = json.dumps(data)

res = requests.post(url='http://1c055c7b-8f78-4dc4-a2ba-543e094a37b6.japaneast.azurecontainer.io/score', data=data, headers={'Content-Type': 'application/json'})
res.json()

image.png

It's coming back properly. Great victory.

I don't want to be charged, so the API has already been removed.

Up to this point, Azure Machine Learning has been hiding in the sense of how to use MLflow, but when it comes to deployment, the goodness of Azure Machine Learning comes out.

At the end

By using Azure Machine Learning as the backend for MLflow, we were able to incorporate experiment management and model management mechanisms into Python's machine learning development environment. By the way, I was able to deploy the model as an API.

This is goodbye to "model_adam_lr_0001_lstm_3_layer_e_512_h_1024.model". You no longer have to accidentally overwrite a file and start learning again, or forget to record parameter settings and become a real face.

I wish I knew this when I was in graduate school when I was doing intensive machine learning.

References

This is a (smooth) document about how Azure Machine Learning works with MLflow. https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-use-mlflow, 2020/12/27 View

I referred to the official document below to understand the implementation and concept. View https://www.mlflow.org/docs/latest/python_api/index.html, 2020/12/27 View https://pytorch.org/docs/stable/index.html, 2020/12/27 https://docs.microsoft.com/ja-jp/python/api/overview/azure/ml/?view=azure-ml-py, 2020/12/27 View

When registering the model with ONNX, I referred to the official content, the PyTorch tutorial, and the article by Mr. Saito of BASE. https://onnx.ai/get-started.html, 2020/12/29 View https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html, 2020/12/29 Browse https://devblog.thebase.in/entry/2019/12/15/110000, 2020/12/29

For the part that creates artifacts using the temporary directory, I referred to the following blog article by momijiame. https://blog.amedama.jp/entry/mlflow-tracking, 2020/12/27 View

For the part that uses AdaBelief, I referred to the official implementation repository of Juntang Zhuang et al. https://github.com/juntang-zhuang/Adabelief-Optimizer, 2020/12/27 View

If you don't know what AdaBelief is, please read the following treatise. He was selected as the Sportlight for NeurIPS 2020. It's insanely amazing. View https://arxiv.org/abs/2010.07468, 2020/12/27

In addition, I referred to the following documents.

[^ 1]: https://docs.microsoft.com/ja-jp/azure/machine-learning/concept-workspace, viewed on December 27, 2020 [^ 2]: https://mlflow.org/docs/latest/concepts.html#mlflow-components, 2020/12/27 View [^ 3]: https://www.mlflow.org/docs/latest/python_api/index.html, 2020/12/27 View [^ 4]: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/track-and-monitor-experiments/using-mlflow/train-projects-local, 2020/12/27 View [^ 5]: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/using-mlflow/train-projects-remote/train-projects -remote.ipynb, viewed 12/27/2020 [^ 6]: https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-deploy-and-where?tabs=azcli#define-an-entry-script, 2020/12/27 Browse [^ 7]: https://docs.microsoft.com/ja-jp/azure/machine-learning/how-to-deploy-advanced-entry-script, viewed on December 29, 2020

Recommended Posts

A story about developing a machine learning model while managing experiments and models with Azure Machine Learning + MLflow
A story about machine learning with Kyasuket
A story about automating online mahjong (Mahjong Soul) with OpenCV and machine learning
Create a python machine learning model relearning mechanism with mlflow
A story about simple machine learning using TensorFlow
A story about data analysis by machine learning
A story about predicting exchange rates with Deep Learning
Build a machine learning scikit-learn environment with VirtualBox and Ubuntu
(Note) A story about creating a question answering system using Spring Boot and machine learning (SVM)
Machine learning A story about people who are not familiar with GBDT using GBDT in Python
A story stuck with the installation of the machine learning library JAX
A story about developing a soft type with Firestore + Python + OpenAPI + Typescript
Inversely analyze a machine learning model
Vulkan compute with Python with VkInline and think about GPU machine learning and more
A story about calculating the speed of a small ball falling while receiving air resistance with Python and Sympy
Personal notes and links about machine learning ① (Machine learning)
A story about Python pop and append
Implement a model with state and behavior
A story about achieving a horse racing recovery rate of over 100% through machine learning
Machine learning beginners tried to make a horse racing prediction model with python
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
A story about Go's global variables and scope
A story about implementing a login screen with django
Looking back on learning with Azure Machine Learning Studio
Build a Python machine learning environment with a container
Until you create a machine learning environment with Python on Windows 7 and run it
A story about an error when loading a TensorFlow model created with Google Colab locally
The procedure from generating and saving a learning model by machine learning, making it an API server, and communicating with JSON from a browser