I tried implementing DeepPose with PyTorch PartⅡ

Introduction

In the previous article (I implemented DeepPose with PyTorch), I compared Chainer and PyTorch while implementing DeepPose. PyTorch is as easy to implement as Chainer, and in terms of performance, the prediction is about the same as Chainer, and the learning is faster than Chainer. This time, we will dig deeper into the performance aspect by conducting the investigation and verification that we left behind last time.

Implementation changes from the last time

Last time, regarding the fact that PyTorch's learning speed is faster than Chainer, PyTorch made a hypothesis that the automatic differentiation of the backward calculation of the Loss function is executed (natively) in C. This time, I changed the implementation in two points to verify it.

  1. Fixed random numbers
  2. Explicit implementation of backward The random numbers are fixed to improve the reproducibility of the verification, and the backward is explicitly implemented to see the effect of the learning speed.

Random number fixed

Added a process to specify a random number seed before starting learning. Note that Chainer's ʻiterator uses MultiprocessIterator`, and it was difficult to fix random numbers in multi-process, so Data Augmentation in the iteration is disabled.

Chainer

     def start(self):
         """ Train pose net. """
+        # set random seed.
+        if self.seed is not None:
+            random.seed(self.seed)
+            np.random.seed(self.seed)
+            if self.gpu >= 0:
+                chainer.cuda.cupy.random.seed(self.seed)
         # initialize model to train.
         model = AlexNet(self.Nj, self.use_visibility)

PyTorch

     def start(self):
         """ Train pose net. """
+        # set random seed.
+        if self.seed is not None:
+            random.seed(self.seed)
+            torch.manual_seed(self.seed)
+            if self.gpu:
+                torch.cuda.manual_seed(self.seed)
         # initialize model to train.
         model = AlexNet(self.Nj)

Explicit implementation of backward

According to Extending PyTorch, in PyTorch, the differentiation of Module is automatic differentiation, and the differentiation of Function is implementation required. It seems. So, this time, it seems that you should implement Function.backward. Also note that the input for Module is Variable and the input for Function is Tensor. Note that Function has a convenient method called save_for_backward, which allows you to save variables for backward, but it does not support values in the middle of calculation and is used for backward calculation. The intermediate result of the forward` calculation is stored in the member variable.

PyTorch

     def forward(self, *inputs):
         x, t, v = inputs
-        diff = x - t
+        self.diff = x - t
         if self.use_visibility:
-            N = (v.sum()/2).data[0]
-            diff = diff*v
+            self.N = v.sum()/2
+            self.diff = self.diff*v
         else:
-            N = diff.numel()/2
-        diff = diff.view(-1)
-        return diff.dot(diff)/N
+            self.N = self.diff.numel()/2
+        diff = self.diff.view(-1)
+        return torch.Tensor([diff.dot(diff)/self.N])
+
+    def backward(self, *grad_outputs):
+        coeff = grad_outputs[0][0]*2/self.N
+        gx0 = coeff*self.diff
+        return gx0, None, None

Verification

An additional experiment was conducted to verify the temporary construction that was set up last time. The data set and environment used for verification are the same as last time.

Effect of automatic differentiation

In order to verify the effect of PyTorch's automatic differentiation on C (native) on learning speed, the requirements for PyTorch's automatic differentiation and explicit differentiation when 100 epochs are trained in the CPU environment and GPU environment, respectively. I measured the time.

CPU environment

In the CPU environment, the learning time of PyTorch's automatic differentiation and explicit differentiation is almost the same.

Library Time required[h]
PyTorch(Automatic differentiation) 47.6
PyTorch(Explicit derivative) 47.6

Since the random numbers are fixed, the learning curves of PyTorch almost overlap. Training_time_comparison_(CPU).png

GPU environment

In the GPU environment, the learning time was slightly slower for PyTorch's automatic differentiation than for explicit differentiation. It's hard to understand that the Python implementation is faster, but it may be due to the randomness of the GPU, and the results may change with a few trials.

Library Time required[h]
PyTorch(Automatic differentiation) 2.60
PyTorch(Explicit derivative) 2.49

Although the random numbers are fixed, the learning curve of PyTorch can be shifted in the time axis direction depending on the implementation method because of the randomness caused by the GPU. Training_time_comparison_(GPU).png

Impact of network layer implementation

Looking at the above experimental results, it seems that the temporary construction made last time is not always the correct answer. Looking at the code again to investigate the cause, in Chainer, the implementation of each layer such as Convolution was Python, and in PyTorch it was C. Since this seems to have a dominant effect on the learning time, Chainer and PyTorch measured the total time required for forward and backward calculations of the Loss function (including network). This time, for a batch size of $ 2 ^ n $, we measured each 100 times and calculated the mean and variance.

CPU environment

In the CPU environment, when $ n = 0 $, the processing time is almost the same, but as the $ n $ increases, PyTorch becomes superior. Considering that the average learning time of 1 epoch of Chainer and PyTorch in the CPU environment is 8.2 [sec] and 5.0 [sec], respectively, the above hypothesis seems to be appropriate. Core_process_time_comparison_(CPU).png

GPU environment

In the GPU environment as well as the CPU environment, the result is that PyTorch becomes superior as $ n $ increases. Considering that the average learning time of 1 epoch of Chainer and PyTorch in the GPU environment is 0.45 [sec] and 0.28 [sec], respectively, the above hypothesis seems to be appropriate. Core_process_time_comparison_(GPU).png

Summary

The hypothesis that I made last time, "The learning time of PyTorch is faster than Chainer is because the bakcward calculation of the Loss function is executed (natively) in C" is half correct and half incorrect. It was. The effect on learning time due to the difference in the implementation method of backward calculation of the Loss function seems to be insignificant. It seems that the dominant influence on the learning time is the implementation method such as Convolution required to calculate each layer of the network. Once you understand it, it's a natural conclusion. However, although I used PyTorch 0.1.10 in this experiment, I also tried the experiment with the latest version 0.1.12 at the time of writing, and the result was that Chainer was rather faster. I have the impression that PyTorch is just under development. The code is on github, so I'll try again when development is settled.

Recommended Posts

I tried implementing DeepPose with PyTorch PartⅡ
I tried implementing DeepPose with PyTorch
I tried batch normalization with PyTorch (+ note)
I tried to implement CVAE with PyTorch
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried fp-growth with python
I tried scraping with Python
I tried Learning-to-Rank with Elasticsearch!
Try implementing XOR with PyTorch
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried gRPC with Python
I tried scraping with python
I made Word2Vec with Pytorch
I tried to implement SSD with PyTorch now (Dataset)
I tried to classify MNIST by GNN (with PyTorch geometric)
I tried to implement SSD with PyTorch now (model edition)
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I tried machine learning with liblinear
I tried moving food with SinGAN
I implemented Attention Seq2Seq with PyTorch
I tried to explain Pytorch dataset
I tried face detection with MTCNN
[PyTorch Tutorial ⑤] Learning PyTorch with Examples (Part 1)
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
[AWS / Tello] I tried operating the drone with my voice Part2
I tried using PyEZ and JSNAPy. Part 4: Automate ISP setup with PyEZ and JSNAPy
I tried to implement sentence classification by Self Attention with PyTorch
[AWS / Tello] I tried operating the drone with my voice Part1
I tried running the DNN part of OpenPose with Chainer CPU
I tried multiple regression analysis with polynomial regression
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
I implemented Shake-Shake Regularization (ShakeNet) with PyTorch
I tried to implement Autoencoder with TensorFlow
I tried linebot with flask (anaconda) + heroku
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried recursion with Python ② (Fibonacci sequence)
I tried playing with the image with Pillow
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried natural language processing with transformers.
#I tried something like Vlookup with Python # 2
I tried handwriting recognition of runes with scikit-learn
I tried to predict next year with AI
I tried using PyEZ and JSNAPy. Part 1: Overview