Machine learning has the problem that it is difficult to manage experiments because it is necessary to manage not only the code used for learning the model but also the data set, the product generated by preprocessing, the model, etc. as a set. Proper experimental management is also important for bringing the code that was working in the experimental stage to the production environment and reproducing similar prediction results.
MLflow is famous for machine learning experiment management, but I found an experiment management tool called ClearML (former name: Allegro Trains), so in this article I will explain how to use ClearML easily. ..
ClearML: https://github.com/allegroai/clearml (Apache-2.0 License) Official documentation: https://allegro.ai/clearml/docs/index.html#
The following articles are also very helpful for the concept of experiment management. Thinking about experiment management Re: ML life starting from zero
ClearML is a tool that provides machine learning experiment management and MLOps functions. It supports time-consuming and error-prone tasks related to development and version tracking in the machine learning life cycle.
ClearML has the following three main functions.
-Experiment management --Automatic experiment management including environment and learning results
In this article, I will mainly explain how to use experiment management among the three functions. It also briefly describes the ClearML architecture at the end.
I have tried using ClearML and confirmed that the following information can be managed as experiment management.
--Code version --Get the Commit ID of the code used for learning and the version of the library as a log --Data version --There is a function to manage the output intermediate products and models. --Hyperparameters --Automatically get Python argparse parameters as logs --Metrics --General loss, accuracy, confusion matrix, etc. can be obtained --Environment --Get the learning directory location of the machine used for learning as a log
Please note that the new version of ClearML may not work as described in this article.
This article uses a free, externally hosted ClearML server. The setup method follows the following document. https://allegro.ai/clearml/docs/docs/getting_started/getting_started_clearml_hosted_service.html
It seems that it is possible to set up your own ClearML server on-premises, AWS, GCP, so if you have security requirements, you can set it by following the document procedure below. https://allegro.ai/clearml/docs/rst/deploying_clearml/index.html
--Sign up at the following site to register your account. --It seems that you can register your account with Google account, Bitbucket, or Github.
https://app.community.clear.ml/login?redirect=%2F
--Enter your name, email, interests, etc. and click "SIGN UP" to register your account.
--Run the following command to install clearml.
pip install clearml
--Execute the following command to start the ClearML setup wizard.
clearml-init
--A message will be displayed asking you to create account credentials, so get the credentials. Click User Account> Profile in the upper right corner of the free host service web screen
--Click Create new credentials> Copy to clipboard.
--When you paste the credential that you copied in the terminal, the message that the credential was detected is displayed as shown below.
Detected credentials key="********************" secret="*******"
--Specify the URL of the web server. This time press Enter by default.
WEB Host configured to: [https://app.community.clear.ml]
--Next, specify the URL of the API server. Keep the defaults and press Enter.
API Host configured to: [https://api.community.clear.ml]
--The following message will be displayed, and the setup is complete.
CLEARML Hosts configuration:
Web App: https://app.community.clear.ml
API: https://api.community.clear.ml
File Store: https://files.community.clear.ml
Verifying credentials ...
Credentials verified!
New configuration stored in /home/<username>/clearml.conf
CLEARML setup completed successfully.
--There is a Tutorial code in ClearML, so clone the repository.
cd ~
git clone https://github.com/allegroai/clearml.git
cd ~/clearml/examples/frameworks/pytorch
pip install -r requirements.txt
pip install pandas scikit-learn
--There is a script for Reporting Tutorial called pytorch_mnist.py
, so copy it and rename the file.
cp pytorch_mnist.py pytorch_mnist_tutorial.py
--The output directory where model checkpoints are output can be set by specifying output_uri
in Task.init
.
--Change the following parts.
task = Task.init(project_name='examples', task_name='pytorch mnist train')
--Checkpoints will be saved in ./clearml
if you make the following changes.
model_snapshots_path = './clearml'
if not os.path.exists(model_snapshots_path):
os.makedirs(model_snapshots_path)
task = Task.init(project_name='examples',
task_name='extending automagical ClearML example',
output_uri=model_snapshots_path)
--When you run the script, ClearML will create the following directory structure.
+ - <output destination name>
| +-- <project name>
| +-- <task name>.<Task Id>
| +-- models
| +-- artifacts
ClearML seems to have explicit reporting of plots, log text, tables, etc. in addition to the automatic logging feature. https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-2-logger-class-reporting-methods
--The logger can be obtained from Task as follows.
logger = task.get_logger
or
logger = Logger.current_logger()
--Use the Logger.report_scalar
method to log scalar metrics as follows:
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
Logger.current_logger().report_scalar(
"train", "loss", iteration=(epoch * len(train_loader) + batch_idx), value=loss.item())
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
--In addition, metrics such as histgram and confusion_matrix other than scalar values can be implemented in the following form.
def test(args, model, device, test_loader, epoch):
save_test_loss = []
save_correct = []
preds = []
targets = []
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
preds.append(pred.cpu().detach().numpy())
targets.append(target.cpu().detach().numpy())
save_test_loss.append(test_loss)
save_correct.append(correct)
test_loss /= len(test_loader.dataset)
Logger.current_logger().report_scalar(
"test", "loss", iteration=epoch, value=test_loss)
Logger.current_logger().report_scalar(
"test", "accuracy", iteration=epoch, value=(correct / len(test_loader.dataset)))
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
preds = np.concatenate(preds)
targets = np.concatenate(targets)
matrix = confusion_matrix(targets, preds) # use confusion matrix of scikit-learn
Logger.current_logger().report_confusion_matrix(title='Confusion matrix example',
series='Test loss / correct', matrix=matrix, iteration=1,
xaxis='correct', yaxis='pred', yaxis_reversed=True)
Logger.current_logger().report_histogram(title='Histogram example', series='correct',
iteration=1, values=save_correct, xaxis='Test', yaxis='Correct')
--You can also use Logger.report_text
to display a text message according to the level
argument.
Logger.current_logger().report_text('The default output destination for model snapshots and artifacts is: {}'.format(model_snapshots_path ), level=logging.DEBUG)
ClearML can also be uploaded to ClearML Server by registering the product when the script is executed. If the product changes, ClearML Server will log the change. However, as of December 29, 2020, only Pandas DataFrame is supported. https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-3-registering-artifacts
--To register the product, add the following code to the test
method as shown below.
# Create the Pandas DataFrame
test_loss_correct = {
'test lost': save_test_loss,
'correct': save_correct
}
df = pd.DataFrame(test_loss_correct, columns=['test lost','correct'])
# Register the test loss and correct as a Pandas DataFrame artifact
Task.current_task().register_artifact('Test_Loss_Correct', df, metadata={'metadata string': 'apple',
'metadata int': 100, 'metadata dict': {'dict string': 'pear', 'dict int': 200}})
--The registered product can be referenced from the Python code as follows, and can be used for later processing.
# Once the artifact is registered, we can get it and work with it. Here, we sample it.
sample = Task.current_task().get_registered_artifacts()['Test_Loss_Correct'].sample(frac=0.5,
replace=True, random_state=1)
You can upload script-generated products to ClearML by using the Task.upload_artifact
method. However, unlike the registration above, this upload is not tracked for changes.
--Put the following code in the test
method to upload the Prediction result.
# Upload test loss as an artifact. Here, the artifact is numpy array
Task.current_task().upload_artifact('Predictions', artifact_object=np.array(save_test_loss),
metadata={'metadata string': 'banana', 'metadata integer': 300,
'metadata dictionary': {'dict string': 'orange', 'dict int': 400}})
--Execute the script with the following command. When executed, logs such as ClearML log and model training loss will be displayed.
python3 pytorch_mnist_tutorial.py
――In this case, the model is saved as follows.
ls clearml/examples/extending\ automagical\ ClearML\ example.13e46b70da274fa085e772ed700df028/models/
mnist_cnn.pt test.pt training.pt
--You can check the learning result on the web screen. Since project_name ='examples'
is passed in the argument of Task.init
, click project of examples on the web screen.
--Since task_name ='extending automagical ClearML example'
in the argument of Task.init
, click the one that is displayed as extending automagical ClearML example and supports learning.
--In EXPERIMENTS EXECUTION, you can check the information of the source code executed during learning. The file name of the executed script and the COMMIT ID are logged so that the experiment can be reproduced.
--In CONFIGURATION, you can check the log of hyperparameters during learning. I found this useful because I don't need to add my own log code for hyperparameters.
--ARTIFACTS allows you to check the output model information and product information.
--RESULTS allows you to see logs related to scalar values and plots. The plot of loss change and accuracy change during learning is as follows.
--In addition, the plot of the confusion matrix is as follows.
That's it for the Reporting Tutorial, which logs metrics and products.
ClearML consists of the following components.
Quote: https://allegro.ai/clearml/docs/rst/architecture/index.html
The ClearML Server shown above is from a free external host this time. As a reminder, ClearML Server can be used by setting up its own server in an on-premises environment, or by setting up a server on the cloud such as AWS or GCP.
Also, it seems to be an advantage that it can be used by just adding the same few lines of code in both the DATA SCIENTIST ENVIRONMENT environment and GPU MACHINES (on-premise or cloud) on the left of the above figure.
--Easy to use by installing pip and adding a few lines of code --Easy to get started with a free external host --By setting up your own server, you can use it both on-premises and in the cloud. --The code of examples is substantial - https://github.com/allegroai/clearml/tree/master/examples --Pytorch, Pytorch-Supports various frameworks such as Lightning, Tensorflow, Keras, AutoKeras - https://allegro.ai/clearml/docs/rst/integrations/index.html --The web UI looks beautiful --It seems that there is a function like MLOps, for example, a function to iteratively tune hyperparameters.
The author pays close attention to the content, functions, etc. of this article, but does not guarantee that the content is accurate or safe. We are not responsible. The author and the organization to which the author belongs (NS Solutions Corporation) shall not be liable for any inconvenience or damage caused to the user by using the contents of this article.
Recommended Posts