Torchserve is PyTorch's open source model service library developed in collaboration with AWS and Facebook. Just give the model class and weight file written in Pytorch to enable the host and provide the API endpoint. You can infer with the Inference API and manage your model with the Management API. See the TorchServe Documentation ( for more information.
"Deploy a PyTorch model for large-scale inference using TorchServe" on the Amazon Web Services blog [ I read inference-at-scale-using-torchserve /) and tried running Torchserve on AWS EC2. In the following, I will introduce the procedure and its surroundings, and execution with docker.
-[Deploy PyTorch models for large-scale inference using TorchServe]( using-torchserve /)
Enter "Deep Learning AMI" in the AMI search bar to search for the AMI you want to use. This time, I used "Deep Learning AMI (Ubuntu 18.04) Version 30.0 --ami-0b1b56cbf0f8fcea3". I used "p2.xlarge" as the instance type. The security group is set up so that ssh and http can be connected from the development environment, and all other settings are left as default.
Log in to EC2 and build the environment.
~$ ls
LICENSE README examples tools
Nvidia_Cloud_EULA.pdf anaconda3 src tutorials
torchserve requires Java 8 or later. Install Java 11 according to the tutorial. Switch the Java to be used after installation to Java 11.
~$ sudo apt-get install openjdk-11-jdk
~$ update-java-alternatives -l
java-1.11.0-openjdk-amd64 1111 /usr/lib/jvm/java-1.11.0-openjdk-amd64
java-1.8.0-openjdk-amd64 1081 /usr/lib/jvm/java-1.8.0-openjdk-amd64
~$ sudo update-alternatives --config java
There are 2 choices for the alternative java (providing /usr/bin/java).
Selection Path Priority Status
0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode
* 1 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode
2 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java 1081 manual mode
Press <enter> to keep the current choice[*], or type selection number: 1
Build a Python environment. Create a virtual environment. Install torchserve related libraries in the virtual environment.
~$ mkdir torchserve-examples
~$ cd torchserve-examples
~/torchserve-examples$ python -m venv venv
~/torchserve-examples$ source venv/bin/activate
(venv):~/torchserve-examples$ pip install torch torchtext torchvision sentencepiece psutil future
(venv):~/torchserve-examples$ pip install torchserve torch-model-archiver
Prepare the model to host. This time we will use the model published in the official repository.
(venv):~/torchserve-examples$ git clone
(venv):~/torchserve-examples$ wget
(venv):~/torchserve-examples$ ls
densenet161-8d451a50.pth serve venv
The model used this time is stored in serve / examples / image_classifier / densenet_161 /
from torchvision.models.densenet import DenseNet
class ImageClassifier(DenseNet):
def __init__(self):
super(ImageClassifier, self).__init__(48, (6, 12, 36, 24), 96)
def load_state_dict(self, state_dict, strict=True):
# '.'s are no longer allowed in module names, but previous _DenseLayer
# has keys 'norm.1', 'relu.1', 'conv.1', 'norm.2', 'relu.2', 'conv.2'.
# They are also in the checkpoints in model_urls. This pattern is used
# to find such keys.
# Credit - _load_state_dict()
import re
pattern = re.compile(r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$')
for key in list(state_dict.keys()):
res = pattern.match(key)
if res:
new_key = +
state_dict[new_key] = state_dict[key]
del state_dict[key]
return super(ImageClassifier, self).load_state_dict(state_dict, strict)
In the previous _DenseLayer, the layer names were'norm.1','relu.1','conv.1','norm.2','relu.2', and'conv.2' with dots. However, dots are not available in the current _DenseLayer. It just inherits and renames DenseNet to use the old weights file in the current model. You can see the inheritance source DenseNet at here.
class DenseNet(nn.Module):
r"""Densenet-BC model class, based on
`"Densely Connected Convolutional Networks" <>`_
growth_rate (int) - how many filters to add each layer (`k` in paper)
block_config (list of 4 ints) - how many layers in each pooling block
num_init_features (int) - the number of filters to learn in the first convolution layer
bn_size (int) - multiplicative factor for number of bottle neck layers
(i.e. bn_size * k features in the bottleneck layer)
drop_rate (float) - dropout rate after each dense layer
num_classes (int) - number of classification classes
memory_efficient (bool) - If True, uses checkpointing. Much more memory efficient,
but slower. Default: *False*. See `"paper" <>`_
def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000, memory_efficient=False):
super(DenseNet, self).__init__()
# First convolution
self.features = nn.Sequential(OrderedDict([
('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2,
padding=3, bias=False)),
('norm0', nn.BatchNorm2d(num_init_features)),
('relu0', nn.ReLU(inplace=True)),
('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
# Each denseblock
num_features = num_init_features
for i, num_layers in enumerate(block_config):
block = _DenseBlock(
self.features.add_module('denseblock%d' % (i + 1), block)
num_features = num_features + num_layers * growth_rate
if i != len(block_config) - 1:
trans = _Transition(num_input_features=num_features,
num_output_features=num_features // 2)
self.features.add_module('transition%d' % (i + 1), trans)
num_features = num_features // 2
# Final batch norm
self.features.add_module('norm5', nn.BatchNorm2d(num_features))
# Linear layer
self.classifier = nn.Linear(num_features, num_classes)
# Official init from torch repo.
for m in self.modules():
if isinstance(m, nn.Conv2d):
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.constant_(m.bias, 0)
def forward(self, x):
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.adaptive_avg_pool2d(out, (1, 1))
out = torch.flatten(out, 1)
out = self.classifier(out)
return out
class _DenseBlock(nn.ModuleDict):
_version = 2
def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate, memory_efficient=False):
super(_DenseBlock, self).__init__()
for i in range(num_layers):
layer = _DenseLayer(
num_input_features + i * growth_rate,
self.add_module('denselayer%d' % (i + 1), layer)
def forward(self, init_features):
features = [init_features]
for name, layer in self.items():
new_features = layer(features)
return, 1)
class _DenseLayer(nn.Module):
def __init__(self, num_input_features, growth_rate, bn_size, drop_rate, memory_efficient=False):
super(_DenseLayer, self).__init__()
self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
self.add_module('relu1', nn.ReLU(inplace=True)),
self.add_module('conv1', nn.Conv2d(num_input_features, bn_size *
growth_rate, kernel_size=1, stride=1,
self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
self.add_module('relu2', nn.ReLU(inplace=True)),
self.add_module('conv2', nn.Conv2d(bn_size * growth_rate, growth_rate,
kernel_size=3, stride=1, padding=1,
self.drop_rate = float(drop_rate)
self.memory_efficient = memory_efficient
def bn_function(self, inputs):
# type: (List[Tensor]) -> Tensor
concated_features =, 1)
bottleneck_output = self.conv1(self.relu1(self.norm1(concated_features))) # noqa: T484
return bottleneck_output
# todo: rewrite when torchscript supports any
def any_requires_grad(self, input):
# type: (List[Tensor]) -> bool
for tensor in input:
if tensor.requires_grad:
return True
return False
@torch.jit.unused # noqa: T484
def call_checkpoint_bottleneck(self, input):
# type: (List[Tensor]) -> Tensor
def closure(*inputs):
return self.bn_function(inputs)
return cp.checkpoint(closure, *input)
@torch.jit._overload_method # noqa: F811
def forward(self, input):
# type: (List[Tensor]) -> (Tensor)
@torch.jit._overload_method # noqa: F811
def forward(self, input):
# type: (Tensor) -> (Tensor)
# torchscript does not yet support *args, so we overload method
# allowing it to take either a List[Tensor] or single Tensor
def forward(self, input): # noqa: F811
if isinstance(input, Tensor):
prev_features = [input]
prev_features = input
if self.memory_efficient and self.any_requires_grad(prev_features):
if torch.jit.is_scripting():
raise Exception("Memory Efficient not supported in JIT")
bottleneck_output = self.call_checkpoint_bottleneck(prev_features)
bottleneck_output = self.bn_function(prev_features)
new_features = self.conv2(self.relu2(self.norm2(bottleneck_output)))
if self.drop_rate > 0:
new_features = F.dropout(new_features, p=self.drop_rate,
return new_features
(Omitted below)
For the model to serve in this way, prepare a model class that inherits nn.Module
. The above example was a bit confusing, so it's easy to see if you look at serve / examples / image_classifier / mnist /
import torch
from torch import nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout2d(0.25)
self.dropout2 = nn.Dropout2d(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
Now it's time to host the model. Convert to the format for deploying the model with torch-model-archiver
(venv):~/torchserve-examples$ mkdir model_store
(venv):~/torchserve-examples$ torch-model-archiver --model-name densenet161 \
--version 1.0 --model-file serve/examples/image_classifier/densenet_161/ \
--serialized-file densenet161-8d451a50.pth \
--export-path model_store \
--extra-files serve/examples/image_classifier/index_to_name.json \
--handler image_classifier
(venv):~/torchserve-examples$ ls model_store/
A file called densenet161.mar
is created. A description of the options can be found here (
(venv):~/torchserve-examples$ torch-model-archiver -h
usage: torch-model-archiver [-h] --model-name MODEL_NAME --serialized-file
--handler HANDLER [--source-vocab SOURCE_VOCAB]
[--extra-files EXTRA_FILES]
[--runtime {python,python2,python3}]
[--export-path EXPORT_PATH]
[--archive-format {tgz,no-archive,default}] [-f]
Torch Model Archiver Tool
optional arguments:
-h, --help show this help message and exit
--model-name MODEL_NAME
Exported model name. Exported file will be named as
model-name.mar and saved in current working directory if no --export-path is
specified, else it will be saved under the export path
--serialized-file SERIALIZED_FILE
Path to .pt or .pth file containing state_dict in case of eager mode
or an executable ScriptModule in case of TorchScript.
--model-file MODEL_FILE
Path to python file containing model architecture.
This parameter is mandatory for eager mode models.
The model architecture file must contain only one
class definition extended from torch.nn.modules.
--handler HANDLER TorchServe's default handler name
or handler python file path to handle custom TorchServe inference logic.
--source-vocab SOURCE_VOCAB
Vocab file for source language. Required for text based models.
--extra-files EXTRA_FILES
Comma separated path to extra dependency files.
--runtime {python,python2,python3}
The runtime specifies which language to run your inference code on.
The default runtime is "python".
--export-path EXPORT_PATH
Path where the exported .mar file will be saved. This is an optional
parameter. If --export-path is not specified, the file will be saved in the
current working directory.
--archive-format {tgz,no-archive,default}
The format in which the model artifacts are archived.
"tgz": This creates the model-archive in <model-name>.tar.gz format.
If platform hosting TorchServe requires model-artifacts to be in ".tar.gz"
use this option.
"no-archive": This option creates an non-archived version of model artifacts
at "export-path/{model-name}" location. As a result of this choice,
MANIFEST file will be created at "export-path/{model-name}" location
without archiving these model files
"default": This creates the model-archive in <model-name>.mar format.
This is the default archiving format. Models archived in this format
will be readily hostable on native TorchServe.
-f, --force When the -f or --force flag is specified, an existing .mar file with same
name as that provided in --model-name in the path specified by --export-path
will overwritten
-v VERSION, --version VERSION
Model's version
The options used this time are as follows.
item | Contents |
model-name | The name of the file to be converted and output |
version | Model version |
model-file | Of the file path |
serialized-file | Model weight file path |
export-path | Output destination path of the converted file |
extra-files | Specify json that describes the rule to convert the predicted index to a string |
handler | Determine Input and OutputhandlerSpecify(image_classifier/object_detector/text_classifier/image_segmenter) You can also make your own |
ʻExtra-files has a rule [ʻindex_to_name.json
]( that converts the predicted index to a string. It is specified.
{"0 ": [" n01440764 "," tench "]," 1 ": ["n01443537 "," goldfish "]," 2 ": ["n01484850 "," great_white_shark "] ,. It's ..
, but I wasn't sure what the first component of the array represented. Also, if you do not specify ʻextra-files`, it seems that a 503 error will occur even if you host it.You can implement the handler yourself. The handle
method is the entry point. The arguments are data
and context
, data
is an array of requests, and the property of context
is here ) Can be seen. See the Documentation ( for more information.
Examples of MNIST can be found in serve / examples / image_classifier / mnist /
import io
import logging
import numpy as np
import os
import torch
from PIL import Image
from torch.autograd import Variable
from torchvision import transforms
logger = logging.getLogger(__name__)
class MNISTDigitClassifier(object):
MNISTDigitClassifier handler class. This handler takes a greyscale image
and returns the digit in that image.
def __init__(self):
self.model = None
self.mapping = None
self.device = None
self.initialized = False
def initialize(self, ctx):
"""First try to load torchscript else load eager mode state_dict based model"""
properties = ctx.system_properties
self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
model_dir = properties.get("model_dir")
# Read model serialize/pt file
model_pt_path = os.path.join(model_dir, "")
# Read model definition file
model_def_path = os.path.join(model_dir, "")
if not os.path.isfile(model_def_path):
raise RuntimeError("Missing the model definition file")
from mnist import Net
state_dict = torch.load(model_pt_path, map_location=self.device)
self.model = Net()
logger.debug('Model file {0} loaded successfully'.format(model_pt_path))
self.initialized = True
def preprocess(self, data):
Scales, crops, and normalizes a PIL image for a MNIST model,
returns an Numpy array
image = data[0].get("data")
if image is None:
image = data[0].get("body")
mnist_transform = transforms.Compose([
transforms.Normalize((0.1307,), (0.3081,))
image =
image = mnist_transform(image)
return image
def inference(self, img, topk=5):
''' Predict the class (or classes) of an image using a trained deep learning model.
# Convert 2D image to 1D vector
img = np.expand_dims(img, 0)
img = torch.from_numpy(img)
inputs = Variable(img).to(self.device)
outputs = self.model.forward(inputs)
_, y_hat = outputs.max(1)
predicted_idx = str(y_hat.item())
return [predicted_idx]
def postprocess(self, inference_output):
return inference_output
_service = MNISTDigitClassifier()
def handle(data, context):
if not _service.initialized:
if data is None:
return None
data = _service.preprocess(data)
data = _service.inference(data)
data = _service.postprocess(data)
return data
Host the model. Specify the directory where the .mar file is stored in --model-store
Specify in --models
in the format of model name = file path. If there are multiple models, you can specify multiple models separated by commas.
(venv):~/torchserve-examples$ torchserve --start --model-store model_store --models densenet161=densenet161.mar
Try requesting the inference API on the same host. The endpoint will be / predictions / {model name}
$ curl -O
$ curl -X POST -T kitten.jpg
"tiger_cat": 0.4693354070186615
"tabby": 0.46338820457458496
"Egyptian_cat": 0.06456134468317032
"lynx": 0.0012828148901462555
"plastic_bag": 0.00023322994820773602
Since ʻimage_classifier` is specified in the handler, the top 5 prediction probabilities are returned. tiger_cat is a tabby cat and tabby is a tabby cat (I don't know the difference). You can see that it can be predicted as a cat for the time being.
The management API is provided on port 8081.
(venv):~/torchserve-examples$ curl ""
"models": [
"modelName": "densenet161",
"modelUrl": "densenet161.mar"
Suppose you have another model. Let's prepare with the code below.
(venv):~/torchserve-examples$ wget
(venv):~/torchserve-examples$ torch-model-archiver --model-name fastrcnn --version 1.0 \
--model-file serve/examples/object_detector/fast-rcnn/ \
--serialized-file fasterrcnn_resnet50_fpn_coco-258fb6c6.pth \
--export-path model_store \
--handler object_detector \
--extra-files serve/examples/object_detector/index_to_name.json
Register the model from the management API.
$ curl -X POST ""
"status": "Model \"fastrcnn\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
$ curl ""
"models": [
"modelName": "densenet161",
"modelUrl": "densenet161.mar"
"modelName": "fastrcnn",
"modelUrl": "fastrcnn.mar"
No workers are assigned to the new model, so the following code sets the minimum number of workers.
$ curl -v -X PUT ""
$ curl "http://localhost:8081/models/fastrcnn"
"modelName": "fastrcnn",
"modelVersion": "1.0",
"modelUrl": "fastrcnn.mar",
"runtime": "python",
"minWorkers": 2,
"maxWorkers": 2,
"batchSize": 1,
"maxBatchDelay": 100,
"loadedAtStartup": false,
"workers": [
"id": "9001",
"startTime": "2020-07-15T13:55:11.813Z",
"status": "READY",
"gpu": true,
"memoryUsage": 0
"id": "9002",
"startTime": "2020-07-15T13:55:11.813Z",
"status": "READY",
"gpu": true,
"memoryUsage": 0
You can also unregister the model.
$ curl -X DELETE http://localhost:8081/models/fastrcnn/
"status": "Model \"fastrcnn\" unregistered"
$ curl ""
"models": [
"modelName": "densenet161",
"modelUrl": "densenet161.mar"
By default, the API can only be accessed from the local host, so make it accessible from the outside as well. Create
(venv):~/torchserve-examples$ touch
The contents are as follows.
to --ts-config
(venv):~/torchserve-examples$ torchserve --start --model-store model_store --models densenet161=densenet161.mar --ts-config
You can access the API from the outside.
$ curl -X POST http://<host ip address>:8080/predictions/densenet161 -T kitten.jpg
Please refer to here for SSL settings and CORS settings.
~$ docker --version
Docker version 19.03.11, build 42e35e61f3
Check the version of CUDA. The version was 10.0.
~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
Check the version of cuDNN. The version was 7.5.1.
~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 5
#include "driver_types.h"
GPU compatible container Only cudnn7 is supported, so first change the CUDA version. How to change is described in here.
~$ sudo rm /usr/local/cuda
~$ sudo ln -s /usr/local/cuda-10.1 /usr/local/cuda
~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
It is 10.1 properly.
Next, create a docker image. Create a Dockerfile and packed with the above operations. The working directory is ~ / torchserve-examples2
├ Dockerfile
FROM alpine/git AS build
COPY . .
RUN git clone && \
wget && \
mkdir model_store
# FROM pytorch/torchserve:0.1.1-cpu
FROM pytorch/torchserve:0.1.1-cuda10.1-cudnn7-runtime
COPY --from=build /work /home/model-server
WORKDIR /home/model-server
RUN torch-model-archiver --model-name densenet161 \
--version 1.0 --model-file serve/examples/image_classifier/densenet_161/ \
--serialized-file densenet161-8d451a50.pth \
--export-path /home/model-server/model-store \
--extra-files serve/examples/image_classifier/index_to_name.json \
--handler image_classifier
CMD ["torchserve", \
"--models", "densenet161=densenet161.mar",\
"--ts-config", ""]
Here, since git and wget are not included in the torchserve container, I created the build container and the execution container separately.
Run the container with the following command:
~$ docker build -t sample/torchserve:latest .
~$ docker run -d --rm -t -p 8080:8080 -p 8081:8081 sample/torchserve:latest
Make a request to the API from the development machine.
$ curl -X POST http://<<host ip address>>:8080/predictions/densenet161 -T kitten.jpg
"tiger_cat": 0.4693359136581421
"tabby": 0.4633873701095581
"Egyptian_cat": 0.06456154584884644
"lynx": 0.001282821292988956
"plastic_bag": 0.00023323031200561672
--There is no authentication function. --By default, TorchServe prints log messages to stderr and stout. TorchServe uses log4j and can customize logging by changing the log4j property.
Recommended Posts