Using MLflow with Databricks ① --Experimental tracking on notebook -

Introduction

In this article, I will write about how to track MLflow, an open source platform that manages the life cycle of machine learning models, under the environment of Databricks. (Assuming Python3)

Databricks managed MLflow

The open source MLflow is designed to work with any ML library, language, or deployment tool, but you'll need to provide your own server for tracking your experiments.

Under the Databricks environment, MLflow can be used as a managed service, so there is no need to prepare a separate tracking server. You can also integrate and manage experiment tracking information in your notebook.

This time, I will write about how to integrate experimental information into a notebook and track it.

How to use MLflow

If the cluster you are using on Databricks is Runtime ML, it is included from the beginning, but in other cases you need to install MLflow.

dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()

You can install it with the above command. Then import.

import mlflow

In MLflow, the tracking start module is called to start tracking, the module that records experiment parameters and logs records it, and the tracking end module ends one experiment tracking.

I think it's a good idea to use with to prevent forgetting to exit. The image of the code of the implementation part is as follows.

  with mlflow.start_run():    #Experiment tracking started

    #Experimental processing

    #Recording of logs and parameters, etc.
    mlflow.log_param("a", a)
    mlflow.log_metric("rmse", rmse)

    #Model record
    mlflow.sklearn.save_model(b, modelpath)
   
    #Saving images output during the experiment
    mlflow.log_artifact("sample.png ")

In addition to parameters, model records and images output during experiments can also be saved in the tracking destination. A separate installation of the appropriate library is required to record the model. Example) scikit-learn model → mlflow.sklearn

Implementation

Let's actually implement it and track it on a notebook.

Sample dataset and model used

Use the scikit-learn diabetes dataset. You can find the explanation of the columns here. https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset

We will use the ElasticNet linear regression model to create this model. There are ʻalpha and l1_ratio` as adjustment parameters.

The explanation here about Elastic Net was easy to understand. https://aizine.ai/ridge-lasso-elasticnet/

setup

Import various libraries, load sample datasets, and create data frames.

#Loading the required libraries
import os
import warnings
import sys

import pandas as pd
import numpy as np
from itertools import cycle
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets

#import mlflow
import mlflow
import mlflow.sklearn

#Loading Diabetes Dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target

#Creating a data frame
Y = np.array([y]).transpose()
d = np.concatenate((X, Y), axis=1)
cols = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'progression']
data = pd.DataFrame(d, columns=cols)

Implementation of result processing part

It defines the process of plotting each coefficient of the explanatory variable when creating a regression model with ElasticNet and saving it as an image in the driver node.

def plot_enet_descent_path(X, y, l1_ratio):
    #Path length(alpha_min / alpha_max)settings of
    eps = 5e-3 

    #Global declaration of image
    global image
    
    print("Computing regularization path using ElasticNet.")
    alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio, fit_intercept=False)

    #View results
    fig = plt.figure(1)
    ax = plt.gca()

    colors = cycle(['b', 'r', 'g', 'c', 'k'])
    neg_log_alphas_enet = -np.log10(alphas_enet)
    for coef_e, c in zip(coefs_enet, colors):
        l1 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)

    plt.xlabel('-Log(alpha)')
    plt.ylabel('coefficients')
    title = 'ElasticNet Path by alpha for l1_ratio = ' + str(l1_ratio)
    plt.title(title)
    plt.axis('tight')

    image = fig
    
    #Save image
    fig.savefig("ElasticNet-paths.png ")

    #close plot
    plt.close(fig)

    # Return images
    return image    

Implementation of experimental processing part

Train the model by specifying ʻalpha and l1_ratio. Call the plot_enet_descent_path` defined above to save the log or image to the tracking destination.

def train_diabetes(data, in_alpha, in_l1_ratio):
  #Evaluation of metrics
  def eval_metrics(actual, pred):
      rmse = np.sqrt(mean_squared_error(actual, pred))
      mae = mean_absolute_error(actual, pred)
      r2 = r2_score(actual, pred)
      return rmse, mae, r2

  warnings.filterwarnings("ignore")
  np.random.seed(40)

  #Data set split
  train, test = train_test_split(data)

  #Splitting the correct label
  train_x = train.drop(["progression"], axis=1)
  test_x = test.drop(["progression"], axis=1)
  train_y = train[["progression"]]
  test_y = test[["progression"]]

  if float(in_alpha) is None:
    alpha = 0.05
  else:
    alpha = float(in_alpha)
    
  if float(in_l1_ratio) is None:
    l1_ratio = 0.05
  else:
    l1_ratio = float(in_l1_ratio)
  
  #Implementation part of mlflow
  # mlflow.start_run()It will be integrated into the notebook by emptying the value of the argument of
  with mlflow.start_run():
    lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
    lr.fit(train_x, train_y)

    predicted_qualities = lr.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

    #Viewing ElasticNet Model Metrics
    print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    #Save log
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("l1_ratio", l1_ratio)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)
    mlflow.sklearn.log_model(lr, "model")
    modelpath = "/dbfs/mlflow/test_diabetes/model-%f-%f" % (alpha, l1_ratio)
    mlflow.sklearn.save_model(lr, modelpath)
    
    # plot_enet_descent_call path
    image = plot_enet_descent_path(X, y, l1_ratio)
    
    #Save output image
    mlflow.log_artifact("ElasticNet-paths.png ")

Experiment

Experiment with the adjustment parameters.

# alpha = 0.01, l1_ratio = 0.Experiment as 01
train_diabetes(data, 0.01, 0.01)

The output result is as follows. スクリーンショット 2020-08-25 17.00.47.png

I will output the image. スクリーンショット 2020-08-25 17.00.56.png

Let's experiment with different parameters. (I tried about 4 patterns in total)

When the experiment is over, press the part labeled [Runs] near the top right. スクリーンショット 2020-08-25 17.08.27.png

You can see that the values of the set parameters and the output metrics are recorded for each experiment. スクリーンショット 2020-08-25 17.10.19.png

in conclusion

This time I wrote about how to integrate and track the results of model training on a notebook. Databricks allows you to compare this experimental data on the UI. As a sequel, I would like to write about model management on the UI.

Recommended Posts

Using MLflow with Databricks ① --Experimental tracking on notebook -
Using MLflow with Databricks ② --Visualization of experimental parameters and metrics -
Using MLflow with Databricks ③ --Model lifecycle management -
Using Graphviz with Jupyter Notebook
I tried MLflow on Databricks
Use MLflow with Databricks ④ --Call model -
Formatting with autopep8 on Jupyter notebook
Notes on using rstrip with python.
Using Python and MeCab with Azure Databricks
Try SVM with scikit-learn on Jupyter Notebook
Enable Jupyter Notebook with conda on remote server
Try using conda virtual environment with Jupyter Notebook
[Pythonocc] I tried using CAD on jupyter notebook
Precautions when using sqlite3 on macOS Sierra (10.12) with multiprocessing
Monitor the training model with TensorBord on Jupyter Notebook
EC2 provisioning with Vagrant + Jupyter (IPython Notebook) on Docker
Troublesome story when using Python3 with VScode on ubuntu