Introduction

In the previous article (I implemented DeepPose with PyTorch), I compared Chainer and PyTorch while implementing DeepPose. PyTorch is as easy to implement as Chainer, and in terms of performance, the prediction is about the same as Chainer, and the learning is faster than Chainer. This time, we will dig deeper into the performance aspect by conducting the investigation and verification that we left behind last time.

Implementation changes from the last time

Last time, regarding the fact that PyTorch's learning speed is faster than Chainer, PyTorch made a hypothesis that the automatic differentiation of the backward calculation of the Loss function is executed (natively) in C. This time, I changed the implementation in two points to verify it.

Fixed random numbers
Explicit implementation of backward The random numbers are fixed to improve the reproducibility of the verification, and the backward is explicitly implemented to see the effect of the learning speed.

Random number fixed

Added a process to specify a random number seed before starting learning. Note that Chainer's ʻiterator uses MultiprocessIterator`, and it was difficult to fix random numbers in multi-process, so Data Augmentation in the iteration is disabled.

Chainer

     def start(self):
         """ Train pose net. """
+        # set random seed.
+        if self.seed is not None:
+            random.seed(self.seed)
+            np.random.seed(self.seed)
+            if self.gpu >= 0:
+                chainer.cuda.cupy.random.seed(self.seed)
         # initialize model to train.
         model = AlexNet(self.Nj, self.use_visibility)

PyTorch

     def start(self):
         """ Train pose net. """
+        # set random seed.
+        if self.seed is not None:
+            random.seed(self.seed)
+            torch.manual_seed(self.seed)
+            if self.gpu:
+                torch.cuda.manual_seed(self.seed)
         # initialize model to train.
         model = AlexNet(self.Nj)

Explicit implementation of backward

According to Extending PyTorch, in PyTorch, the differentiation of Module is automatic differentiation, and the differentiation of Function is implementation required. It seems. So, this time, it seems that you should implement Function.backward. Also note that the input for Module is Variable and the input for Function is Tensor. Note that Function has a convenient method called save_for_backward, which allows you to save variables for backward, but it does not support values in the middle of calculation and is used for backward calculation. The intermediate result of the forward` calculation is stored in the member variable.

PyTorch

     def forward(self, *inputs):
         x, t, v = inputs
-        diff = x - t
+        self.diff = x - t
         if self.use_visibility:
-            N = (v.sum()/2).data[0]
-            diff = diff*v
+            self.N = v.sum()/2
+            self.diff = self.diff*v
         else:
-            N = diff.numel()/2
-        diff = diff.view(-1)
-        return diff.dot(diff)/N
+            self.N = self.diff.numel()/2
+        diff = self.diff.view(-1)
+        return torch.Tensor([diff.dot(diff)/self.N])
+
+    def backward(self, *grad_outputs):
+        coeff = grad_outputs[0][0]*2/self.N
+        gx0 = coeff*self.diff
+        return gx0, None, None

Verification

An additional experiment was conducted to verify the temporary construction that was set up last time. The data set and environment used for verification are the same as last time.

Effect of automatic differentiation

In order to verify the effect of PyTorch's automatic differentiation on C (native) on learning speed, the requirements for PyTorch's automatic differentiation and explicit differentiation when 100 epochs are trained in the CPU environment and GPU environment, respectively. I measured the time.

CPU environment

In the CPU environment, the learning time of PyTorch's automatic differentiation and explicit differentiation is almost the same.

Library	Time required[h]
PyTorch(Automatic differentiation)	47.6
PyTorch(Explicit derivative)	47.6

Since the random numbers are fixed, the learning curves of PyTorch almost overlap. Training_time_comparison_(CPU).png

GPU environment

In the GPU environment, the learning time was slightly slower for PyTorch's automatic differentiation than for explicit differentiation. It's hard to understand that the Python implementation is faster, but it may be due to the randomness of the GPU, and the results may change with a few trials.

Library	Time required[h]
PyTorch(Automatic differentiation)	2.60
PyTorch(Explicit derivative)	2.49

Although the random numbers are fixed, the learning curve of PyTorch can be shifted in the time axis direction depending on the implementation method because of the randomness caused by the GPU. Training_time_comparison_(GPU).png

Impact of network layer implementation

Looking at the above experimental results, it seems that the temporary construction made last time is not always the correct answer. Looking at the code again to investigate the cause, in Chainer, the implementation of each layer such as Convolution was Python, and in PyTorch it was C. Since this seems to have a dominant effect on the learning time, Chainer and PyTorch measured the total time required for forward and backward calculations of the Loss function (including network). This time, for a batch size of $ 2 ^ n $, we measured each 100 times and calculated the mean and variance.

CPU environment

In the CPU environment, when $ n = 0 $, the processing time is almost the same, but as the $ n $ increases, PyTorch becomes superior. Considering that the average learning time of 1 epoch of Chainer and PyTorch in the CPU environment is 8.2 [sec] and 5.0 [sec], respectively, the above hypothesis seems to be appropriate. Core_process_time_comparison_(CPU).png

GPU environment

In the GPU environment as well as the CPU environment, the result is that PyTorch becomes superior as $ n $ increases. Considering that the average learning time of 1 epoch of Chainer and PyTorch in the GPU environment is 0.45 [sec] and 0.28 [sec], respectively, the above hypothesis seems to be appropriate. Core_process_time_comparison_(GPU).png

Summary

The hypothesis that I made last time, "The learning time of PyTorch is faster than Chainer is because the bakcward calculation of the Loss function is executed (natively) in C" is half correct and half incorrect. It was. The effect on learning time due to the difference in the implementation method of backward calculation of the Loss function seems to be insignificant. It seems that the dominant influence on the learning time is the implementation method such as Convolution required to calculate each layer of the network. Once you understand it, it's a natural conclusion. However, although I used PyTorch 0.1.10 in this experiment, I also tried the experiment with the latest version 0.1.12 at the time of writing, and the result was that Chainer was rather faster. I have the impression that PyTorch is just under development. The code is on github, so I'll try again when development is settled.

I tried implementing DeepPose with PyTorch PartⅡ

Introduction

Implementation changes from the last time

Random number fixed

Explicit implementation of backward

Verification

Effect of automatic differentiation

CPU environment

GPU environment

Impact of network layer implementation

CPU environment

GPU environment

Summary