How to improve model metric monitoring in Amazon SageMaker

This article is the 9th day article of mixi Group Advent Calendar 2019.

Overview

(Summary of 3 lines)

Challenge: innocent:

Amazon SageMaker provides CloudWatch Metrics-based charts for training job metrics monitoring (https://aws.amazon.com/jp/blogs/news/easily-monitor-and-visualize-metrics-while -training-models-on-amazon-sagemaker /) and now also appear in the job details in the management console It's easy to set up, but personally, it's a tough impression for algorithm metrics monitoring, such as log smoothness (output frequency), scale, and unit notation.

Action 1: Use the SageMaker SDK API (Python): relaxed:

Use the SageMaker SDK's TrainingJobAnalytics (https://sagemaker.readthedocs.io/en/stable/analytics.html#sagemaker.analytics.TrainingJobAnalytics) to get the data and control the drawing yourself The data source is still CloudWatchLogs (not fundamentally resolved), but ** readability can be significantly improved **

You can have it drawn in Jupyter Notebook during training, or you can draw it in the code of the Estimator caller at regular or end time and save it in place.

analytics.py



metric_names = ['train:loss','validation:loss']

metrics_dataframe = sagemaker.analytics.TrainingJobAnalytics(
    training_job_name=training_job_name,
    metric_names=metric_names,
    period=60, #1 min is the limit value
).dataframe()

#Formatting dataframe
...

plt = metrics_dataframe_fixed.plot(
    kind='line', 
    figsize=(20,15), 
    fontsize=18,
    x='timestamp', 
    y=[metric_names[0],metric_names[1]], 
    xlim=[0, 2000],
    ylim=[0.1, 0.5],
    style=['b.-','r+-'], 
    rot=45,
 )
plt.figure.savefig('metrics_training_job_xxx.png')
plt.clf()

What you can do

You can also use this method in SageMaker built-in algorithm

Things impossible

Action 2: Draw a graph in your own entry point or your own algorithm: blush:

SageMaker has officially 4 ways, but ML framework provided by Amazon Container and [Case using original container](https://docs.aws.amazon.com/ja_jp/ With sagemaker / latest / dg / your-algorithms.html), you can periodically graph and output the situation during training with your own program code and send it to S3.

I'm afraid I'm using SageMaker but not using the monitoring features provided, but if I can't get it in the format I want, I have to take it inside the container (because I write the entry point script and my own ML algorithm myself). , The effort to add graph drawing to the code you know is not so big)

What you can do

Things impossible

Process flow

"Where to send the graph drawn in the container and how to share the graph placement destination (S3 path) inside and outside the container" is surprisingly difficult, but the following method can be used as an example.

  1. Define the ML model output destination for each training job on S3
  2. Also define a conditions place in the same location and put a JSON file with information about the model
  3. Define a place for metrics in the same place, and use it as a place to put metrics data and drawn graphs.
  4. Pass the conditions path as ʻinputs` to the Estimator and start training
  5. Run the training algorithm inside the SageMaker container
  6. In the training job, refer to the JSON in conditions and assemble the model output destination path
  7. Execute training, output the progress log and draw a graph
  8. Upload the drawn graph data (fixed point) to the S3 graph placement destination metrics.
  9. Process the graph uploaded to S3 as you like
  10. Always monitor, send notifications, etc.

Code example

train_task.py



#conditions generation, training_job_Record name
dict_conditions = { "training_job_name" : training_job_name }
s3_conditions_path = '/model/{}/conditions/training_job_config.json'.format(training_job_name)
boto3.resource('s3').Object(bucket,s3_conditions_path).put(Body=json.dumps(dict_conditions))

#Hand over conditions to sagemaker training job
Estimator.fit(
    job_name=training_job_name,
    inputs={'train_data':s3_train_data_path,'conditions':s3_conditions_path},
)

train_entrypoint.py



# Estimator.get the training job name from the conditions passed from the fit caller
#(The path corresponding to the dict key of the passed inputs is generated and the file is placed)
input_conditions = '/opt/ml/input/data/conditions/training_job_config.json'
with open(input_conditions) as f:
    conditions = json.load(f)
    training_job_name = input_conditions['training_job_name']

#Graph path definition
graph_name = 'training_history_{}.png'.format(metrics)
graph_outpath = '{}/{}'.format(output_path,graph_name)
s3_graph_outpath = '/model/{}/metrics/{}'.format(training_job_name,graph_name)

#Draw and save graph (keras example)
history = model.fit(...)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.figure.savefig(graph_outpath)
plt.clf()

#S3 Send graph to bucket to save training job results (update)
boto3.resource('s3').Bucket(bucket).upload_file(graph_outpath,s3_graph_outpath)

As shown in the code, make it possible to pass training_job_name with json in the part that calls Estimator of sagemaker, and from the shared information, the metric output destination for each training job is in the specified format (s3: // {bucket } / Model / {training_job_name} /metrics/{graph_name}.png)

Digression: Draw with TensorBoard

In Action 2, you can write the code freely, so you can output the log for TensorBoard, synchronize it with the specified bucket of S3, and draw it by referring to the log on S3 from TensorBoard launched with Notebook Instance etc. Masu

train_entrypoint_keras.py



tensorboard_log_outpath = '{}/{}'.format(output_path,tensorboard_log_name)

tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir=tensorboard_log_outpath, 
    histogram_freq=1)
callbacks = [tensorboard_callback]

model.fit(..., callbacks=callbacks)

boto3.resource('s3').Bucket(bucket).upload_file(
    tensorboard_log_outpath, s3_tensorboard_log_outpath)

notebook.py


tensorboard --logdir={s3_tensorboard_log_outpath}

It is possible to draw with other tools of your choice, but I think it is better to take a well-balanced method based on the management cost.

Summary: blush:

As I mentioned several times along the way, the options you can take differ depending on How to use SageMaker.

For both measures 1 and 2, I think it is easier to manage by ** uploading the metric data (Dataframe and log) and the drawn graph image to the same S3 as the model storage area **.

I want to organize the metrics to be compared with the same definition so that they can be judged at a glance.

reference

Recommended Posts

How to improve model metric monitoring in Amazon SageMaker
How to get multiple model objects randomly in Django
How to perform learning in SageMaker without session timeout
How to use the model learned in Lobe in Python
How to use Spacy Japanese model in Google Colaboratory
How to develop in Python
[TF] How to load / save Model and Parameter in Keras
[Python] How to do PCA in Python
How to handle session in SQLAlchemy
How to use classes in Theano
How to write soberly in pandas
How to collect images in Python
How to update Spyder in Anaconda
How to use SQLite in Python
How to install wkhtmltopdf (Amazon Linux2)
How to convert 0.5 to 1056964608 in one shot
How to reflect CSS in Django
How to kill processes in bulk
How to use Mysql in python
How to convert Tensorflow model to Lite
How to wrap C in Python
How to use ChemSpider in Python
How to use PubChem in Python
How to improve when Spyder's editor is very heavy in Mavericks
How to run TensorFlow 1.0 code in 2.0
How to handle Japanese in Python
How to log in to Docker + NGINX
How to call PyTorch in Julia
How to make a model for object detection using YOLO in 3 hours
How to store CSV data in Amazon Kinesis Streams with standard input
[Introduction to Python] How to use class in Python?
How to suppress display error in matplotlib
How to access environment variables in Python
How to dynamically define variables in Python
How to do R chartr () in Python
How to convert csv to tsv in CLI
How to delete expired sessions in Django
[Itertools.permutations] How to put permutations in Python
How to use Google Test in C
How to implement nested serializer in drf-flex-fields
How to work with BigQuery in Python
How to update php on Amazon linux 2
How to execute commands in jupyter notebook
How to do'git fetch --tags' in GitPython
How to get a stacktrace in python
How to display multiplication table in python
How to extract polygon area in Python
How to reassign index in pandas dataframe
How to check opencv version in python
How to enable SSL (TLS) in Apache
How to use Anaconda interpreter in PyCharm
How to install Anisble on Amazon Linux 2
How to specify non-check target in Flake8
How to handle consecutive values in MySQL
How to switch python versions in cloud9
How to adjust image contrast in Python
How to use __slots__ in Python class
How to dynamically zero pad in Python
How to do Server-Sent Events in Django
How to use regular expressions in Python
How to implement Scroll View in pythonista 1