In this article, I will write about how to track MLflow, an open source platform that manages the life cycle of machine learning models, under the environment of Databricks. (Assuming Python3)
The open source MLflow is designed to work with any ML library, language, or deployment tool, but you'll need to provide your own server for tracking your experiments.
Under the Databricks environment, MLflow can be used as a managed service, so there is no need to prepare a separate tracking server. You can also integrate and manage experiment tracking information in your notebook.
This time, I will write about how to integrate experimental information into a notebook and track it.
If the cluster you are using on Databricks is Runtime ML, it is included from the beginning, but in other cases you need to install MLflow.
dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()
You can install it with the above command. Then import.
import mlflow
In MLflow, the tracking start module is called to start tracking, the module that records experiment parameters and logs records it, and the tracking end module ends one experiment tracking.
I think it's a good idea to use with
to prevent forgetting to exit.
The image of the code of the implementation part is as follows.
with mlflow.start_run(): #Experiment tracking started
#Experimental processing
#Recording of logs and parameters, etc.
mlflow.log_param("a", a)
mlflow.log_metric("rmse", rmse)
#Model record
mlflow.sklearn.save_model(b, modelpath)
#Saving images output during the experiment
mlflow.log_artifact("sample.png ")
In addition to parameters, model records and images output during experiments can also be saved in the tracking destination.
A separate installation of the appropriate library is required to record the model.
Example) scikit-learn model → mlflow.sklearn
Let's actually implement it and track it on a notebook.
Use the scikit-learn diabetes dataset. You can find the explanation of the columns here. https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset
We will use the ElasticNet linear regression model to create this model.
There are ʻalpha and
l1_ratio` as adjustment parameters.
The explanation here about Elastic Net was easy to understand. https://aizine.ai/ridge-lasso-elasticnet/
Import various libraries, load sample datasets, and create data frames.
#Loading the required libraries
import os
import warnings
import sys
import pandas as pd
import numpy as np
from itertools import cycle
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
#import mlflow
import mlflow
import mlflow.sklearn
#Loading Diabetes Dataset
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
#Creating a data frame
Y = np.array([y]).transpose()
d = np.concatenate((X, Y), axis=1)
cols = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'progression']
data = pd.DataFrame(d, columns=cols)
It defines the process of plotting each coefficient of the explanatory variable when creating a regression model with ElasticNet and saving it as an image in the driver node.
def plot_enet_descent_path(X, y, l1_ratio):
#Path length(alpha_min / alpha_max)settings of
eps = 5e-3
#Global declaration of image
global image
print("Computing regularization path using ElasticNet.")
alphas_enet, coefs_enet, _ = enet_path(X, y, eps=eps, l1_ratio=l1_ratio, fit_intercept=False)
#View results
fig = plt.figure(1)
ax = plt.gca()
colors = cycle(['b', 'r', 'g', 'c', 'k'])
neg_log_alphas_enet = -np.log10(alphas_enet)
for coef_e, c in zip(coefs_enet, colors):
l1 = plt.plot(neg_log_alphas_enet, coef_e, linestyle='--', c=c)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
title = 'ElasticNet Path by alpha for l1_ratio = ' + str(l1_ratio)
plt.title(title)
plt.axis('tight')
image = fig
#Save image
fig.savefig("ElasticNet-paths.png ")
#close plot
plt.close(fig)
# Return images
return image
Train the model by specifying ʻalpha and
l1_ratio. Call the
plot_enet_descent_path` defined above to save the log or image to the tracking destination.
def train_diabetes(data, in_alpha, in_l1_ratio):
#Evaluation of metrics
def eval_metrics(actual, pred):
rmse = np.sqrt(mean_squared_error(actual, pred))
mae = mean_absolute_error(actual, pred)
r2 = r2_score(actual, pred)
return rmse, mae, r2
warnings.filterwarnings("ignore")
np.random.seed(40)
#Data set split
train, test = train_test_split(data)
#Splitting the correct label
train_x = train.drop(["progression"], axis=1)
test_x = test.drop(["progression"], axis=1)
train_y = train[["progression"]]
test_y = test[["progression"]]
if float(in_alpha) is None:
alpha = 0.05
else:
alpha = float(in_alpha)
if float(in_l1_ratio) is None:
l1_ratio = 0.05
else:
l1_ratio = float(in_l1_ratio)
#Implementation part of mlflow
# mlflow.start_run()It will be integrated into the notebook by emptying the value of the argument of
with mlflow.start_run():
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
lr.fit(train_x, train_y)
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
#Viewing ElasticNet Model Metrics
print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)
#Save log
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)
mlflow.sklearn.log_model(lr, "model")
modelpath = "/dbfs/mlflow/test_diabetes/model-%f-%f" % (alpha, l1_ratio)
mlflow.sklearn.save_model(lr, modelpath)
# plot_enet_descent_call path
image = plot_enet_descent_path(X, y, l1_ratio)
#Save output image
mlflow.log_artifact("ElasticNet-paths.png ")
Experiment with the adjustment parameters.
# alpha = 0.01, l1_ratio = 0.Experiment as 01
train_diabetes(data, 0.01, 0.01)
The output result is as follows.
I will output the image.
Let's experiment with different parameters. (I tried about 4 patterns in total)
When the experiment is over, press the part labeled [Runs] near the top right.
You can see that the values of the set parameters and the output metrics are recorded for each experiment.
This time I wrote about how to integrate and track the results of model training on a notebook. Databricks allows you to compare this experimental data on the UI. As a sequel, I would like to write about model management on the UI.
Recommended Posts