Hello, this is Ninomiya of LIFULL CO., LTD.

In a machine learning project, after successful analysis and model accuracy evaluation, it must be successfully used in existing systems. At that time, it was difficult for our team to divide the roles of the engineers in charge of implementation.

Aiming for the state that "If a data scientist creates it in this format, it can be easily incorporated!", Wrap Amazon SageMaker and general purpose to some extent. We have prepared a development format and tools that can be used for various purposes.

What is Amazon SageMaker?

Amazon SageMaker provides all developers and data scientists with the means to build, train, and deploy machine learning models. Amazon SageMaker is a fully managed service that covers the entire machine learning workflow. Label and prepare your data, select algorithms, train your model, tune and optimize for deployment, make predictions, and execute. You can put your model into production with less effort and cost.

As the main functions, if you prepare a Docker image that meets specific specifications, you can use the following functions.

Read the official docs and @ taniyam's (same team as me) article for specifications on preparing your own Docker image with SageMaker.

Machine learning project format

First, we asked data scientists to prepare the following directory structure.

.
├── README.md
├── Dockerfile
├── config.yml
├── pyproject.toml (poetry config file)
├── script
│   └── __init__.py
└── tests
    └── __init__.py

The main process is written in script / __ init__.py, and the script is as follows. This is the library prepared by simple_sagemaker_manager.

import pandas as pd
from typing import List
from pathlib import Path
from sklearn import tree
from simple_sagemaker_manager.image_utils import AbstractModel


def train(training_path: Path) -> AbstractModel:
    """Do learning.

    Args:
        training_path (Path):Directory with csv files
    
    Returns:
        Model:Model object that inherits AbstractModel
        
    """
    train_data = pd.concat([pd.read_csv(fname, header=None) for fname in training_path.iterdir()])
    train_y = train_data.iloc[:, 0]
    train_X = train_data.iloc[:, 1:] 

    # Now use scikit-learn's decision tree classifier to train the model.
    clf = tree.DecisionTreeClassifier(max_leaf_nodes=None)
    clf = clf.fit(train_X, train_y)
    return Model(clf)


class Model(AbstractModel):
    """The method of serialization is described in AbstractModel.
    """

    def predict(self, matrix: List[List[float]]) -> List[List[str]]:
        """Inference processing.

        Args:
            matrix (List[List[float]]):Table data

        Returns:
            list:Inference result

        """
        #The result returned here will be the response of the inference API.
        return [[x] for x in self.model.predict(pd.DataFrame(matrix))]

ʻAbstractModel has the following definition, and the result of calling the savemethod (the result serialized by pickle) is saved, and this is used as a model when executing the training batch (used by the SageMaker system). It will be saved in S3. Also, the serialization method can be switched by overridingsave and load`.

import pickle
from abc import ABC, abstractmethod
from dataclasses import dataclass


@dataclass
class AbstractModel(ABC):
    model: object

    @classmethod
    def load(cls, model_path):
        #Save the model during the training batch
        with open(model_path / 'model.pkl', 'rb') as f:
            model = pickle.load(f)
        return cls(model)

    def save(self, model_path):
        #Load the model during inference
        with open(model_path / 'model.pkl', 'wb') as f:
            pickle.dump(self.model, f)

    @abstractmethod
    def predict(self, json):
        pass

I try to operate with cli by referring to projects such as Python's poetry. The development flow of Docker image of SageMaker is as follows.

Create a project template (smcli new project name)
Edit the template
Build the image (smcli build)
Push the image to ECR (smcli push)

Also, I made it possible to edit the Dockerfile because some machine learning libraries can only be installed with Anaconda, so I received a request that "I want you to replace it with other than the official image of Python3".

Run management of SageMaker

It's hard to run boto3 directly, so I've also prepared a wrapped library. There are a lot of operations, but in many projects we have three things we want to do: "learn the model" and "run an OR batch conversion job that sets up an inference API", so we have an interface that makes it easy to understand.

from simple_sagemaker_manager.executor import SageMakerExecutor
from simple_sagemaker_manager.executor.classes import TrainInstance, TrainSpotInstance, Image


client = SageMakerExecutor()

#When learning with a normal instance
model = client.execute_batch_training(
    instance=TrainInstance(
        instance_type='ml.m4.xlarge',
        instance_count=1,
        volume_size_in_gb=10,
        max_run=100
    ),
    image=Image(
        name="decision-trees-sample",
        uri="xxxxxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/decision-trees-sample:latest"
    ),
    input_path="s3://xxxxxxxxxx/DEMO-scikit-byo-iris",
    output_path="s3://xxxxxxxxxx/output",
    role="arn:aws:iam::xxxxxxxxxx"
)


#When learning with Spot Instances
model = client.execute_batch_training(
    instance=TrainSpotInstance(
        instance_type='ml.m4.xlarge',
        instance_count=1,
        volume_size_in_gb=10,
        max_run=100,
        max_wait=1000
    ),
    image=Image(
        name="decision-trees-sample",
        uri="xxxxxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/decision-trees-sample:latest"
    ),
    input_path="s3://xxxxxxxxxx/DEMO-scikit-byo-iris",
    output_path="s3://xxxxxxxxxxx/output",
    role="arn:aws:iam::xxxxxxxxxxxxx"
)

The inference API is made as follows. The points I devised are as follows.

If the endpoint with the specified name does not exist, create a new endpoint.
If it exists, update it. Requests will be accepted during Updating.
Models can now be received in a list. If you specify multiple models, create Pipeline model and then deploy. Do.

from simple_sagemaker_manager.executor import SageMakerExecutor
from simple_sagemaker_manager.executor.classes import EndpointInstance, Model

client = SageMakerExecutor()


#When deploying a specific model
#If you specify multiple models in models, a Pipeline model will be created and used.
client.deploy_endpoint(
    instance=EndpointInstance(
        instance_type='ml.m4.xlarge',
        initial_count=1,
        initial_variant_wright=1
    ),
    models=[
        Model(
            name='decision-trees-sample-191028-111309-538454',
            model_arn='arn:aws:sagemaker:ap-northeast-1:xxxxxxxxxx',
            image_uri='xxxxxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/decision-trees-sample:latest',
            model_data_url='s3://xxxxxxxxxx/model.tar.gz'
        )
    ],
    name='sample-endpoint',
    role="arn:aws:iam::xxxxxxxxxx"
)

# execute_batch_You can also pass the result of training
model = client.execute_batch_training(
    #Arguments omitted
) 

client.deploy_endpoint(
    instance=EndpointInstance(
        instance_type='ml.m4.xlarge',
        initial_count=1,
        initial_variant_wright=1
    ),
    models=[model],
    name='sample-endpoint',
    role="arn:aws:iam::xxxxxxxxxx"
)

Names other than endpoints (learning batch jobs, etc.) are automatically added with the current time string to avoid duplication. However, only the endpoint has the behavior of "update if there is one with the same name" to improve convenience.

Also, although omitted, the batch conversion job method is implemented in the same way.

Future issues

I implemented it like this, and now I am actually using it in the implementation of some projects. However, there are some issues that have not been implemented yet, and there are still other issues within the team.

Container image template other than table data (image etc.) is not implemented
Ability to save the intermediate state in S3 when processing is interrupted in SpotInstance of the learning batch
There is no function to support API implementation and testing, such as launching an inference API locally.
Acquisition of metrics during learning. This is verified by another member focusing on MLflow.
There are still many issues regarding data analysis and model accuracy evaluation.

Also, when you actually use it within the team, there are some parts that are not easy to use, so I will try to solve those problems and make the machine learning project more efficient.

Prepare a machine learning project format and run it on SageMaker

What is Amazon SageMaker?

Machine learning project format

Run management of SageMaker

Future issues