There are many types of deep learning frameworks such as PyTorch, Tensorflow, and Keras. This time, I will pay attention to ** PyTorch **, which I often use!
Did you know that ** C ++ version has been released ** as well as PyTorch and Python version? This makes it easy to incorporate if you want to use Deep Learning as part of the processing of your C ++ program!
The C ++ version of PyTorch, I was wondering ** "C ++ is a compiled language, so maybe it's faster than the Python version?" **.
So, this time, I actually investigated ** "How much speed is different between C ++ and Python?" </ Font> **! Also, I was concerned about the accuracy, so I checked it.
This time, as the title suggests, we will use the C ++ version of "PyTorch". You can download it from the following site, so please try it!
PyTorch Official: https://pytorch.org/
I downloaded it with the above settings. The "Preview (Nightly) version" always has the latest files. However, it is still under development, so if you want to use the stable version, select "Stable (1.4)".
Also, "Run this Command" at the bottom is quite important, and if you have a build version of CXX of 11 or above, we recommend that you select the bottom one. Currently it is almost CXX17, so I think it's okay below. If you select the above, link errors of other libraries will occur and it will be a lot of trouble.
This time, ** convolutional autoencoder </ font> ** (Convolutional Autoencoder) is used. Available from my GitHub → https://github.com/koba-jon/pytorch_cpp
This model maps the ** input image (higher dimension) ** to ** latent space (lower dimension) **, and this time based on this ** latent variable (lower dimension) **, the ** image ( The purpose is to generate high-dimensional) ** and minimize the error between this and the input image. After training, this model can generate a high-dimensional image again from a high-dimensional image through a low-dimensional space, so that ** a latent space that more characterizes the training image ** can be obtained. In other words, it has the role of dimensional compression and can be called a so-called non-linear principal component analysis. This is very convenient because it has various uses such as ** curse of dimensionality **, ** transfer learning **, and ** anomaly detection **.
Now, I will explain the structure of the model to be used.
Expecting these effects, we built the following network.
Operation | Kernel Size | Stride | Padding | Bias | Feature Map | BN | Activation | ||
---|---|---|---|---|---|---|---|---|---|
Input | Output | ||||||||
1 | Convolution | 4 | 2 | 1 | False | 3 | 64 | ReLU | 2 | 64 | 128 | True | ReLU |
3 | 128 | 256 | True | ReLU | |||||
4 | 256 | 512 | True | ReLU | |||||
5 | 512 | 512 | True | ReLU | |||||
6 | 512 | 512 | |||||||
7 | Transposed Convolution | 512 | 512 | True | ReLU | ||||
8 | 512 | 512 | True | ReLU | |||||
9 | 512 | 256 | True | ReLU | |||||
10 | 256 | 128 | True | ReLU | |||||
11 | 128 | 64 | True | ReLU | |||||
12 | 64 | 3 | tanh |
This time, we will use the CelebA dataset, which is a dataset of 202,599 celebrity face images (color). The image size is 178 x 218 [pixel], which causes some inconvenience when deconvolution occurs, so this time I resized it to ** 64 x 64 [pixel] **. Of these, ** 90% (182,340) were used for learning images **, and ** 10% (20,259) were used for test images **.
When this is input to the previous model, the latent space becomes (C, H, W) = (512,1,1). If you input an image of 128 x 128 [pixel] or more, the intermediate layer becomes a spatial latent space.
This time, I will mainly investigate ** "How much speed is different between C ++ and Python" **, but I would like to compare the speed and performance under the following 5 types of environments. ..
--CPU main operation - Python - C++ --GPU main operation - Python --Non-deterministic --Deterministic - C++
As you can see from the above features, in Deep Learning that handles images, GPU is overwhelmingly advantageous in terms of calculation speed.
CPU.py
device = torch.device('cpu') #Use CPU
model.to(device) #Move model to CPU
image = image.to(device) #Move data to CPU
GPU.py
device = torch.device('cuda') #Use default GPU
device = torch.device('cuda:0') #Use the first GPU
device = torch.device('cuda:1') #Use second GPU
model.to(device) #Move model to GPU
image = image.to(device) #Move data to GPU
CPU.cpp
torch::Device device(torch::kCPU); //Use CPU
model->to(device); //Move model to CPU
image = image.to(device); //Move data to CPU
GPU.cpp
torch::Device device(torch::kCUDA); //Use default GPU
torch::Device device(torch::kCUDA, 0); //Use the first GPU
torch::Device device(torch::kCUDA, 1); //Use second GPU
model->to(device); //Move model to GPU
image = image.to(device); //Move data to GPU
In the Python version of PyTorch, in the case of learning using GPU, cuDNN is used to ** improve the learning speed **.
However, unlike C ++, just because the learning speed is improved does not mean that the exact same situation can be reproduced by turning the learning again.
Therefore, the PyTorch formula states that the behavior of cuDNN needs to be deterministic as follows in order to ensure reproducibility, and at the same time, the speed decreases.
https://pytorch.org/docs/stable/notes/randomness.html
Deterministic mode can have a performance impact, depending on your model. This means that due to the deterministic nature of the model, the processing speed (i.e. processed batch items per second) can be lower than when the model is non-deterministic.
From the engineer's point of view, we included it in this speed comparison because we may be concerned about the presence or absence of reproducibility and the speed changes depending on the presence or absence of reproducibility.
Unlike the "rand" function in C ++, if you do not set any initial value of the random number, it will be random, so in order to ensure reproducibility in Python, ** explicitly ** the initial value of the random number should be set. You need to set it. (The setting of the initial value of the random number does not affect the speed.)
The implementation is as follows.
deterministic.py
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True #Deterministic instead of slowing down
torch.backends.cudnn.benchmark = False #Deterministic instead of slowing down
non_deterministic.py
torch.backends.cudnn.deterministic = False #Faster instead of being non-deterministic
torch.backends.cudnn.benchmark = True #Speed up when the image size does not change
Even if the content you want to implement is the same, if the programming language changes, ** notation and rules ** may change, and ** required libraries ** may change. Since both Python and C ++ are object-oriented languages, the concepts themselves are similar, but above all, Python is an interpreter type and C ++ is a compiled type, so it must be implemented considering that dynamic typing does not work for C ++. Also, since PyTorch's C ++ API is currently under development, it should be taken into consideration that some functions are not available.
Based on these points, I will introduce the implementation differences between Python and C ++, and the programs I implemented.
In addition to the usage status of the Python library that is generally written at present, the library recommended for implementation in C ++ and the usage status of the library of the program that I actually wrote are also described.
Python (recommended) | C++(Recommendation) | C++(self made) | |
---|---|---|---|
Handling command line arguments | argparse | boost::program_options | boost::program_options |
Model design | torch.nn | torch::nn | torch::nn |
Preprocessing (transform) | torchvision.transforms | torch::data::transforms (for various pre-processing before execution) or Self-made (when various preprocessing after execution) |
Self-made (using OpenCV) |
Get datasets (datasets) | torchvision.datasets (using Pillow) | Self-made (using OpenCV) | Self-made (using OpenCV) |
Dataloader | torch.utils.data.DataLoader | torch::data::make_data_loader (for classification) or Self-made (other than classification) |
Self-made (using OpenMP) |
Loss function (loss) | torch.nn | torch::nn | torch::nn |
Optimizer | torch.optim | torch::optim | torch::optim |
Error back propagation method (backward) | torch.Tensor.backward() | torch::Tensor::backward() | torch::Tensor::backward() |
progress bar | tqdm | boost | self made |
** At the moment (2020/03/24) </ font> **, it looks like the above.
When using the PyTorch library in C ++, the class name and function name are almost the same as Python. This seems to be due to the user's consideration on the producer side. I am very grateful!
Next, I will describe the points that you should be especially careful when writing PyTorch programs in C ++.
The following is an excerpt of a part of the program I wrote.
networks.hpp (partial excerpt)
using namespace torch;
namespace po = boost::program_options;
struct ConvolutionalAutoEncoderImpl : nn::Module{
private:
nn::Sequential encoder, decoder;
public:
ConvolutionalAutoEncoderImpl(po::Variables_map &vm);
torch::Tensor forward(torch::Tensor x);
}
TORCH_MODULE(ConvolutionalAutoEncoder);
When designing a model, use the "torch :: nn" class as in Python. Also, when creating a model, use a structure. (There is also a class version, but it seems a bit complicated) At this time, it should be noted that ** nn :: Module is inherited </ font> ** as in Python. This is the same as how to write Python.
The next important thing is to name the ** structure "[model name] Impl" </ font> ** and under the structure ** Add "TORCH_MODULE ([model name])" </ font> **. If you do not do this, you will not be able to save or load the model. Also, by setting "TORCH_MODULE ([model name])", you can declare the ordinary structure "ConvolutionalAutoEncoderImpl" as the structure for the model "ConvolutionalAutoEncoder", but probably by inheriting the class further internally Are you there? (Expected) Therefore, when accessing member variables, such as "model-> to (device)", ** "->" (arrow operator) </ font> ** Please note that you need to use.
Next, regarding the above matters, I will explain the points to note when using the nn class module. You can use "nn :: Sequential" as well as Python. To add a module to "nn :: Sequential" in C ++, use ** "push_back" </ font> ** like vector type. Here, please be careful to use ** "->" (arrow operator) </ font> ** to call the "push_back" function. The implementation example looks like the following.
networks.cpp (partial excerpt / modification)
nn::Sequential sq;
sq->push_back(nn::Conv2d(nn::Conv2dOptions(3, 64, /*kernel_size=*/4).stride(2).padding(1).bias(false)));
sq->push_back(nn::BatchNorm2d(64));
sq->push_back(nn::ReLU(nn::ReLUOptions().inplace(true)));
When creating transform, datasets, and dataloader by yourself, when passing tensor type data to other variables, ** ".clone ()" is used **. I was addicted to it here. Is the tensor type related to handling calculation graphs? (Expected), the value in the tensor may change if it is not set in this way.
transforms.cpp (partial excerpt)
void transforms::Normalize::forward(torch::Tensor &data_in, torch::Tensor &data_out){
torch::Tensor data_out_src = (data_in - this->mean) / this->std;
data_out = data_out_src.clone();
return;
}
Other programs are almost the same as the Python version, and there is nothing particularly addictive to them. Also, I made my own class that I thought was a little difficult to use because it was different from the Python version. Please refer to the following GitHub for the specific program. https://github.com/koba-jon/pytorch_cpp/tree/master/ConvAE
Perhaps I will write a commentary article on the source code. If you have an opinion that "this is strange", we welcome you so please comment.
Basically, you can think that it is almost the same except for the part that can not be helped, such as the library that is in Python does not exist in C ++. Also, you can think that you haven't changed from the GitHub program.
Specifically, the following contents have been unified when comparing the Python version and the C ++ version.
For each object to be compared, 182,340 64 × 64 images of celebA were used to mini-batch train the convolutional autoencoder model by 1 [epoch] so as to minimize the L1 error. ** "Time per [epoch]" </ font> ** and ** "GPU memory usage" </ font> I checked **.
Here, "time per [epoch]" includes the processing time of tqdm and the function you created. I included this because it had little effect on the total processing time, and because it is more convenient to have visualization when actually using PyTorch, many people use it.
In addition, using the trained model, 20,259 test images were input to the model one by one and tested. ** "Average velocity of forward propagation" </ font> ** and ** "L1 error between input image and output image" </ font> I also checked> **.
Then, I learned and tested without launching anything other than the "executable file" and "nvidia-smi" (the one that was running from the beginning when Ubuntu started).
CPU(Core i7-8700) | GPU(GeForce GTX 1070) | |||||
---|---|---|---|---|---|---|
Python | C++ | Python | C++ | |||
Non-deterministic th> | Deterministic th> | |||||
Learning td> | time [time / epoch] td> | 1 hour 04 minutes 49 seconds td> | 1 hour 03:00 td> | 5 minutes 53 seconds td> | 7 minutes 42 seconds td> | 17 minutes 36 seconds td> |
GPU memory [MiB] td> | 2 | 9 | 933 | 913 | 2941 | |
Test td> | Speed [seconds / data] td> | 0.01189 | 0.01477 | 0.00102 | 0.00101 | 0.00101 |
L1 error (MAE) td> | 0.12621 | 0.12958 | 0.12325 | 0.12104 | 0.13158 |
C ++ is a compiled language. Therefore, I thought that I would beat Python, an interpreted language, and ** both were good matches **.
In terms of learning time, we found that the CPU is almost the same, and the GPU is more than twice as slow as C ++ than Python. (why?) This is the result, because the CPU is about the same and it is very different only with the GPU.
Is likely to be mentioned. As the following people are experimenting, it seems that there is no mistake in the result that ** Python is faster </ font> ** when GPU main is running. https://www.noconote.work/entry/2019/01/11/151624
Also, the speed and performance of inference (testing) are almost the same as Python, so ** Python may be better at present </ font> **.
The memory usage of GPU is also large for some reason. (Even though ReLU's place is set to True ...)
It is a result of Python (GPU) deterministic and non-deterministic, but as the formula clearly states, deterministic is slower.
After all, the time here will change.
Learning speed
1st place: Python version (non-deterministic, GPU main operation)
2nd place: Python version (deterministic, GPU main operation)
3rd place: C ++ version (GPU main operation)
4th place: CPU main operation (Python version, C ++ version similar)
Inference speed
1st place: GPU main operation (Python version, C ++ version similar)
2nd place: CPU main operation (Python version, C ++ version similar)
Performance
All are about the same
This time, I compared the speed and performance of PyTorch for Python and C ++.
As a result, Python and C ++ are almost the same in terms of performance, so I thought that there would be no problem using PyTorch of ** C ++ **. However, ** At this stage, it may not be recommended to do C ++ PyTorch for speed </ font> **.
Perhaps the C ++ API is still evolving and may improve significantly in the future! From now on, this is my expectation!
Recommended Posts