This is a continuation of Speeding up Deep Learning with CPU of Raspberry Pi 4.
In order to accelerate deep learning, it is necessary to investigate which processing consumes how much time and reduce the actual processing time. Therefore, first use the profile function of ONNX Runtime for profiling.
How to enable the profiling feature is described in the ONNX Official Tutorial (https://microsoft.github.io/onnxruntime/python/auto_examples/plot_profiling.html "").
sample.py
import onnxruntime
options = onnxruntime.SessionOptions()
options.enable_profiling = True # <-Profile function enabled
session = onnxruntime.InferenceSession(path_to_model, options)
[Profile target]
prof_file = session.end_profiling()
print(prof_file)
Profile results are saved in JSON format. In addition, profile results can be visualized with the Tracing tool built into Chrome. (Enter ** chrome: // tracing / ** in the Chrome URL to launch the tool.)
This time, let's take a profile in the case of performing image classification with ** MobileNetV1 depth 1.0 224x224 **. Each model uses the model with ONNX Runrime Graph Optimization.
The process is processed in the order of model loading, session initialization, and model execution. If you expand the model execution part, the Convolution process, which is a fusion of the Batch Normalization process, is executed multiple times. (Batch Normalization should be handled independently, but ONNX Runrime Graph Optimization As a result, it is integrated into the Convolution process.)
The table below summarizes the profile results.
item | processing time(ms) | Percentage(%) |
---|---|---|
All processing | 157.19 | - |
Convolution | 148.017 | 94.2 |
gemm | 6.053 | 3.9 |
Other | 3.12 | 1.9 |
It turns out that in order to reduce the overall processing time, we have to do something about Convolution processing. I would like to consider what kind of approach is necessary from the next time onwards.
Recommended Posts