Deep Learning Gaiden ~ GPU Programming ~

Overview

The previous article is here In the previous article, I wrote that "it takes several hours to learn with Keras' MNIST dataset". I can't do that with a small dataset like the MNIST dataset, so I'll speed it up. Speaking of speeding up deep learning, it is the use of GPU and TPU ~ So, in this article, we will do GPU programming to use NVIDIA GPU. The package used is CuPy. The reason will be later ...

-[Speed up deep learning](#Speed up deep learning) - GPU - TPU -[GPU programming with CuPy](gpu programming with #cupy) -[Installation and confirmation of CuPy](Installation and confirmation of #cupy) -[CuPy programming](#cupy programming) -[Confirm effect](# Confirm effect) -[For further speed](# For further speed) -[Time measurement of error calculation](# Time measurement of error calculation) -[Time measurement of learning part](# Time measurement of learning part) -

[Time measurement of back propagation](# Time measurement of back propagation) </ summary>

-[Time measurement of back propagation of pooling layer](# Time measurement of back propagation of pooling layer)

-[Speed up pooling layer](# Speed up pooling layer) -Conclusion

Accelerate deep learning

Note: You can skip it as it speaks slowly.

The development of computer architecture in recent years is progressing at an unprecedented level. For example, when I was a kid, Game Boy Advance was popular, but the data capacity of the games running there was ** 32MB ** at the maximum. However, nowadays, when it comes to games that run on extremely spec game hardware such as PS4, PS5, and Switch, it seems that there is a data capacity of ** 10 GB ** as a matter of course. Since GB is 1024 times larger than MB, it means that ** it has become possible to handle nearly 1000 times the amount of data in just 10 years **. From this, you can see the progress of storage media such as HDDs and SSDs.

Of course, as the amount of data handled increases, so does the number of instructions processed by the computer. I do not know that the processing power required for CPUs is exhausted, but the development of CPUs is doubled in 18 months (recently 24 months) according to the empirical rule called "** Moore's Law **" to respond to it. It became ** [^ 1]. This means that ** 15 years will increase the processing performance of your computer by 1024 times **. That's amazing ~

However, as mentioned earlier, the processing power required of a CPU is required for the blue ceiling every time the CPU improves its performance and can do more. For this reason, lack of performance has always been lamented.

There is no doubt that the background of deep learning in the limelight is the improvement of CPU performance, and there are many scenes where the lack of CPU performance is lamented. One of them is image recognition and ** convolutional neural network (CNN) **. Since the image data is two-dimensional, if you try to use a slightly large image data set for learning, it will quickly become a tensor with more than 10,000 units of elements, and the current CPU is overwhelmingly insufficient in performance. Previous article does not specifically mention it, but when I experimented, when I tried to learn using Keras' MNIST dataset, it was on google colaboratory. It also takes ** 1 epoch 30 minutes **. Since google colaboratory has a 12-hour limit, you can only learn 24 epochs (unless you resume learning from temporary save & reload). Well, you can still learn with sufficient accuracy with the MNIST dataset.

Anyway, it's not easy to experiment with this. That's why the GPU attracted attention.

GPU The CPU is called the Central Processing Unit, while the GPU is called the Graphics Processing Unit. As the name suggests, it is a ** semiconductor processor that specializes in calculations for screen drawing **. ** CPU is excellent for general-purpose calculation **, while ** GPU is specialized for calculation for image processing **, so its speed is overwhelming. Since the massively parallel calculation is performed with thousands or more cores, the screen drawing is basically performed without lag. And here is the miso, but this super parallel calculation and matrix calculation have an affinity. Please note that this figure is not just a metaphor, it is not actually doing this on the GPU. What I would like you to read from this figure is that ** matrix calculation can be executed in parallel **. By the way, the above figure can be realized by parallelizing CPUs, but GPUs are orders of magnitude larger.

GPUs that have been focused on deep learning have made a great contribution to their development through the advent of GPGPU: General-Purpose computing on Graphics Processing Units: technologies that use graphic processing devices for general-purpose calculations.

TPU Now, with the advent of GPUs and GPGPUs, deep learning has made rapid progress, but it is human nature that is not enough. That's why TPU: Tensor Processing Unit: Tensor Processing Unit was introduced.

The GPU was designed for graphics only, but TPU ** was designed to realize high-speed tensor calculation for deep learning. At the expense of versatility and a bit of computing accuracy, we've achieved speeds that overwhelm the GPU, and even reach our feet.

Since it is specialized in tensor calculation, it is less versatile than GPU, and it is speeded up by reducing the calculation from 32bit or 64bit to 8bit or 16bit. Furthermore, in order to reduce even writing to the cache memory, data is exchanged in the arithmetic circuit, and anyway, tensor calculation is done at high speed.

AlphaGo Zero is a typical example of its overwhelming power. The amount of calculation that would take about ** 30,000 years ** when converted to CPU by simple calculation has been completed in ** 3 days ** using multiple TPUs. I don't understand the meaning. Lol

As such, the benefits of massively parallel computing on deep learning are enormous.

GPU programming with `CuPy`

Now, let's get into the main subject. In this article, we will use CuPy for GPU programming. It seems that CuPy was originally a package developed for GPU program implementation (CUDA programming) in Chainer. The biggest advantage is that it follows numpy, so most code just rewrites np (ʻimport numpy as np) to cp (ʻimport cupy as cp). This is why I decided to use CuPy in this article. Easy is great! Lol

I rarely think about it. Or rather, it's dirty because I implemented it as a prototype ... I'll sort it out soon. I wonder if I should use a decorator ... Please let me know if you have a good idea. The code is here.

** By the way, in order to use GPU with google colaboratory, you need to select GPU as the runtime type. ** **

Installation and confirmation of `CuPy`

To install CuPy, execute the cell where you entered the following code.

!curl https://colab.chainer.org/install | sh -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1580  100  1580    0     0   6666      0 --:--:-- --:--:-- --:--:--  6666
+ apt -y -q install cuda-libraries-dev-10-0
Reading package lists...
Building dependency tree...
Reading state information...
cuda-libraries-dev-10-0 is already the newest version (10.0.130-1).
0 upgraded, 0 newly installed, 0 to remove and 11 not upgraded.
+ pip install -q cupy-cuda100  chainer 
     |████████████████████████████████| 348.0MB 51kB/s 
+ set +ex
Installation succeeded!

This will automatically install the required version of CuPy. Also Chainer. I don't use it, but it's okay. You can check if it is installed properly with the following code.

!python -c 'import chainer; chainer.print_runtime_info()'

Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Chainer: 7.4.0
ChainerX: Not Available
NumPy: 1.18.5
CuPy: Not Available
iDeep: 2.0.0.post3

It is OK if you can confirm the output like this.

`CuPy` programming

Take (a part of) the activation function as an example.

`activator.py`


import numpy as np
import cupy as cp


class Activator():
    def __init__(self, *args, mode="cpu", **kwds):
        self.mode = mode

        if self.mode == "cpu":
            self.forward = self.cpu_forward
            self.backward = self.cpu_backward
            self.update = self.cpu_update
        elif self.mode == "gpu":
            self.forward = self.gpu_forward
            self.backward = self.gpu_backward
            self.update = self.gpu_update
    

    def cpu_forward(self, *args, **kwds):
        raise NotImplemented

    def gpu_forward(self, *args, **kwds):
        raise NotImplemented


    def cpu_backward(self, *args, **kwds):
        raise NotImplemented
    
    def gpu_backward(self, *args, **kwds):
        raise NotImplemented


    def cpu_update(self, *args, **kwds):
        raise NotImplemented

    def gpu_update(self, *args, **kwds):
        raise NotImplemented


class step(Activator):
    def cpu_forward(self, x, *args, **kwds):
        return np.where(x > 0, 1, 0)
    
    def gpu_forward(self, x, *args, **kwds):
        return cp.where(x > 0, 1, 0)


    def cpu_backward(self, x, *args, **kwds):
        return np.zeros_like(x)
    
    def gpu_backward(self, x, *args, **kwds):
        return cp.zeros_like(x)

I'm writing about brain death. Somehow, there must be a smarter way ... What we're doing is branching by taking advantage of Python's ability to assign functions as some sort of object. The function implementation part is also different from np and cp! This is the nice thing about CuPy. GPU programming can be done easily and conveniently ~

Confirmation of effect

Let's experiment with the Keras MNIST dataset. Execution executes all cells up to the experimental code, executes the cell that reads Keras data, and finally executes the CNN experimental code body.

`cnn_main.py`


%matplotlib inline
#Create convolution layer and output layer
M, F_h, F_w = 10, 3, 3
lm = LayerManager((x_train, x_test), (t_train, t_test), mode="gpu")
#lm.append(name="c", I_shape=(C, I_h, I_w), F_shape=(M, F_h, F_w), pad=1,
#          wb_width=0.1, opt="AdaDelta", opt_dic={"eta": 1e-2})
lm.append(name="c", I_shape=(C, I_h, I_w), F_shape=(M, F_h, F_w), pad=1)
lm.append(name="p", I_shape=lm[-1].O_shape, pool=2)
#lm.append(name="m", n=100, wb_width=0.1,
#          opt="AdaDelta", opt_dic={"eta": 1e-2})
lm.append(name="m", n=100)
#lm.append(name="o", n=n_class, act="softmax", err_func="Cross", wb_width=0.1,
#          opt="AdaDelta", opt_dic={"eta": 1e-2})
lm.append(name="o", n=n_class, act="softmax", err_func="Cross")

#To learn
epoch = 5
threshold = 1e-8
n_batch = 8
lm.training(epoch, threshold=threshold, n_batch=n_batch, show_train_error=True)

#Predict
print("training dataset")
_ = lm.predict(x=lm.x_train, y=lm.y_train)
print("test dataset")
if lm.mode == "cpu":
    y_pred = lm.predict()
elif lm.mode == "gpu":
    y_pred = lm.predict().get()

progress:[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX]483s/514s
training dataset
correct: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7]
predict: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7]
accuracy rate: 98.58 % (59148/60000)
test dataset
correct: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5]
predict: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5]
accuracy rate: 97.58 % (9758/10000)

There is no particular meaning, but the activation function, weight range wb_width, optimizer, etc. are the defaults. In other words, the activation function is ReLU, wb_width is 0.05, and the optimizer is Adam. The learning epoch is set to 5.

The execution result is about 100 seconds per epoch! We have succeeded in speeding up 18 times. It's still slow, but it should be practical. Except for MNIST ... (distant eyes)

For further speeding up

By the way, at the bottom of the test code is the MNIST dataset learning code in Keras. I copied it from here.

mnist_cnn.py

`mnist_cnn.py`


'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
 
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
 
batch_size = 128
num_classes = 10
epochs = 12
 
# input image dimensions
img_rows, img_cols = 28, 28
 
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
 
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)
 
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
 
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
 
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
 
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
 
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])
 
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Epoch 1/12
469/469 [==============================] - 4s 9ms/step - loss: 2.2889 - accuracy: 0.1426 - val_loss: 2.2611 - val_accuracy: 0.2889
Epoch 2/12
469/469 [==============================] - 4s 9ms/step - loss: 2.2432 - accuracy: 0.2350 - val_loss: 2.2046 - val_accuracy: 0.4885
Epoch 3/12
469/469 [==============================] - 4s 9ms/step - loss: 2.1837 - accuracy: 0.3312 - val_loss: 2.1279 - val_accuracy: 0.5908
Epoch 4/12
469/469 [==============================] - 4s 9ms/step - loss: 2.1039 - accuracy: 0.4035 - val_loss: 2.0235 - val_accuracy: 0.6492
Epoch 5/12
469/469 [==============================] - 4s 9ms/step - loss: 1.9959 - accuracy: 0.4669 - val_loss: 1.8864 - val_accuracy: 0.6989
Epoch 6/12
469/469 [==============================] - 4s 9ms/step - loss: 1.8604 - accuracy: 0.5193 - val_loss: 1.7149 - val_accuracy: 0.7420
Epoch 7/12
469/469 [==============================] - 4s 9ms/step - loss: 1.6990 - accuracy: 0.5681 - val_loss: 1.5179 - val_accuracy: 0.7688
Epoch 8/12
469/469 [==============================] - 4s 9ms/step - loss: 1.5315 - accuracy: 0.6014 - val_loss: 1.3180 - val_accuracy: 0.7912
Epoch 9/12
469/469 [==============================] - 4s 9ms/step - loss: 1.3717 - accuracy: 0.6327 - val_loss: 1.1394 - val_accuracy: 0.8029
Epoch 10/12
469/469 [==============================] - 4s 9ms/step - loss: 1.2431 - accuracy: 0.6562 - val_loss: 0.9945 - val_accuracy: 0.8171
Epoch 11/12
469/469 [==============================] - 4s 9ms/step - loss: 1.1369 - accuracy: 0.6757 - val_loss: 0.8818 - val_accuracy: 0.8263
Epoch 12/12
469/469 [==============================] - 4s 9ms/step - loss: 1.0520 - accuracy: 0.6957 - val_loss: 0.7949 - val_accuracy: 0.8356
Test loss: 0.7948545217514038
Test accuracy: 0.8356000185012817

It's fast ... and 20 times faster ... that means there's still room for speed!

Now, let's find out what is the bottleneck of calculation speed with my current code. The processing time is measured using timeit magic. If you use this, it will measure the processing time well.

Time measurement of error calculation

First, let's measure the time required for error calculation.

`search_bottleneck.py`


#Calculation of training error
%%timeit
lm.forward(lm.x_train)
error = lm[-1].get_error(lm.y_train)

#----------output----------
# 1 loop, best of 3: 957 ms per loop
#--------------------------


#Calculation of test error
%%timeit
lm.forward(lm.x_test)
error = lm[-1].get_error(lm.y_test)

#----------output----------
# 10 loops, best of 3: 160 ms per loop
#--------------------------

The error calculation of the training data has a data amount of 60,000, so it will be like this. I think it's okay to have less ... I think we can improve here. As with the test data, if you reduce it to 10,000, it will be shortened by about 0.8 seconds per epoch. Well, that's kind of an error. Considering the whole (100 seconds per epoch), the error calculation accounts for about 1% as a whole, so this is not a bottleneck. So the learning part seems to be a problem.

Time measurement of the learning part

We will measure the processing time of the learning part. Somewhere in learning, 99% of the processing time per epoch should be ...

`search_bottleneck.py`


#The data for one mini-batch is the measurement target.
rand_index = np.arange(lm.x_train.get().shape[0])
np.random.shuffle(rand_index)
rand = rand_index[0 : n_batch]

#Forward propagation calculation
%%timeit
lm.forward(lm.x_train[rand])

#----------output----------
# 1000 loops, best of 3: 1.32 ms per loop
#--------------------------

#Backpropagation calculation
%%timeit
lm.backward(lm.y_train[rand])

#----------output----------
# 100 loops, best of 3: 10.3 ms per loop
#--------------------------

#Weight update calculation
%%timeit
lm.update()

#----------output----------
# 1000 loops, best of 3: 1.64 ms per loop
#--------------------------

Obviously, only backpropagation takes an unusual amount of time. It takes 10 times longer for forward propagation and weight updates. Since the training data is 60,000 this time and the mini-batch size is 8, this calculation process will be repeated 7500 times, so in total ~~ $ (1.32 + 10.3 + 1.64) \ times 7500 \ times 10 ^ {- It will cost 3} = 23.92s $. That's less than I expected ...? It's taken enough time, but it's still not enough ... Well, there is a lot of fluctuation, so don't worry about it for the time being. ~~ It was just a calculation error ... I have to understand the specifications of the calculator properly. (1.32+10.3+1.64) \times 7500 \times 10^{-3} = 99.45s Anyway, the back propagation is abnormally slow, so we will measure it in more detail.

Backpropagation time measurement

So, we will divide the back propagation process and measure it.

`search_bottleneck.py`


#Advance preparation
err3 = lm[3].backward(lm.y_train[rand])
err2 = lm[2].backward(err3)
err2 = err2.reshape(n_batch, *lm[1].O_shape)
err1 = lm[1].backward(err2)
err0 = lm[0].backward(err1)

#Output layer backpropagation
%%timeit
err3 = lm[3].backward(lm.y_train[rand])

#----------output----------
# 10000 loops, best of 3: 152 µs per loop
#--------------------------

#Backpropagation of the middle layer
%%timeit
err2 = lm[2].backward(err3)
err2 = err2.reshape(n_batch, *lm[1].O_shape)

#----------output----------
# 1000 loops, best of 3: 224 µs per loop
#--------------------------

#Backpropagation of pooling layer
%%timeit
err1 = lm[1].backward(err2)

#----------output----------
# 1000 loops, best of 3: 9.72 ms per loop
#--------------------------

#Backpropagation of convolution layer
%%timeit
err0 = lm[0].backward(err1)

#----------output----------
# 1000 loops, best of 3: 442 µs per loop
#--------------------------

It turns out that the pooling layer is orders of magnitude slower. The processing time of the pooling layer accounts for about 93.6% of the calculation time of back propagation. By the way, if you add this, it will be about 10ms, so it's almost the same.

Time measurement of back propagation of pooling layer

So let's take a closer look at the backpropagation of the pooling layer in question.

`search_bottleneck.py`


#Advance preparation
B, C, O_h, O_w = n_batch, *lm[1].O_shape
grad = err2.transpose(0, 2, 3, 1).reshape(-1, 1)
grad_x = cp.zeros((grad.size, lm[1].pool*lm[1].pool))
grad_x1 = grad_x.copy()
grad_x1[:, lm[1].max_index] = grad
grad_x2 = grad_x1.reshape(B*O_h*O_w, C*lm[1].pool*lm[1].pool).T

#Dimensional swapping and transformation of error
%%timeit
grad = err2.transpose(0, 2, 3, 1).reshape(-1, 1)

#----------output----------
# 100000 loops, best of 3: 17.1 µs per loop
#--------------------------

#Empty matrix generation
%%timeit
grad_x = cp.zeros((grad.size, lm[1].pool*lm[1].pool))

#----------output----------
# 100000 loops, best of 3: 7.89 µs per loop
#--------------------------

#Value filling
%%timeit
grad_x1[:, lm[1].max_index] = grad

#----------output----------
# 1000 loops, best of 3: 9.5 ms per loop
#--------------------------

#Deformation and transposition
%%timeit
grad_x2 = grad_x1.reshape(B*O_h*O_w, C*lm[1].pool*lm[1].pool).T

#----------output----------
# 1000000 loops, best of 3: 1.86 µs per loop
#--------------------------

# col2im
%%timeit
grad_x3 = lm[1].col2im(grad_x2, (n_batch, *lm[1].I_shape), lm[1].O_shape,
                       stride=lm[1].pool, pad=lm[1].pad_state)

#----------output----------
# 10000 loops, best of 3: 112 µs per loop
#--------------------------

Price filling is overwhelmingly slower than others ... That's the bottleneck here. The ratio of value filling to the back propagation of the pooling layer is actually about 98.6%.

The GPU is strong in simple calculations, but when it comes to such a little complicated processing, it slows down at once, and you can not make full use of the performance. So let's think about an improvement plan.

Speeding up the pooling layer

I wondered if there was a good way to fill in the price when speeding up. The first thing I thought about was that the CPU is better suited for such complicated processing, so it should be processed by the CPU instead of the GPU. However, the experiment found that the bottleneck when the whole was processed by the CPU was also in the same part, so this idea was rejected.

The next idea was to rewrite this part of the process into another form. In other words, I thought, "Let's replace this assignment process with a calculation process that the GPU is good at." So instead of holding the index, you just have to hold a sparse matrix that has the same shape as the input (thrown to the ʻim2col` function). Only the location corresponding to the maximum value is 1, otherwise it is 0. The amount of memory required is double the normal $ pool ^ 2 $, but $ pool $ is usually small, which is fine.

`pool.py`


import numpy as np
import cupy as cp


class PoolingLayer(BaseLayer):
    def __init__(self, *, mode="cpu",
                 I_shape=None, pool=1, pad=0,
                 name="", **kwds):
        self.mode = mode

        self.name = name
        
        if I_shape is None:
            raise KeyError("Input shape is None.")
        
        if len(I_shape) == 2:
            C, I_h, I_w = 1, *I_shape
        else:
            C, I_h, I_w = I_shape
        self.I_shape = (C, I_h, I_w)

        #Holds im2col and col2im functions
        if self.mode == "cpu":
            self.im2col = cpu_im2col
            self.col2im = cpu_col2im
        elif self.mode == "gpu":
            self.im2col = gpu_im2col
            self.col2im = gpu_col2im
        
        if self.mode == "cpu":
            _, O_shape, self.pad_state = self.im2col(
                np.zeros((1, *self.I_shape)),
                (pool, pool),
                stride=pool, pad=pad)
        elif self.mode == "gpu":
            _, O_shape, self.pad_state = self.im2col(
                cp.zeros((1, *self.I_shape)),
                (pool, pool),
                stride=pool, pad=pad)
        self.O_shape = (C, *O_shape)
        
        self.n = np.prod(self.O_shape)
        
        self.pool = pool
        self.F_shape = (pool, pool)
    
    
    def forward(self, x):
        B = x.shape[0]
        C, O_h, O_w = self.O_shape
        
        self.x, _, self.pad_state = self.im2col(x, self.F_shape,
                                                stride=self.pool,
                                                pad=self.pad_state)
        self.x = self.x.T.reshape(B*O_h*O_w*C, -1)

        if self.mode == "cpu":
            #self.max_index = np.argmax(self.x, axis=1)
            self.y = np.max(self.x, axis=1, keepdims=True)
            self.max_index = np.where(self.y == self.x, 1, 0)
            self.y = self.y.reshape(B, O_h, O_w, C).transpose(0, 3, 1, 2)
        elif self.mode == "gpu":
            #self.max_index = cp.argmax(self.x, axis=1)
            self.y = cp.max(self.x, axis=1, keepdims=True)
            self.max_index = cp.where(self.y == self.x, 1, 0)
            self.y = self.y.reshape(B, O_h, O_w, C).transpose(0, 3, 1, 2)
        
        return self.y
    
    
    def backward(self, grad):
        B = grad.shape[0]
        I_shape = B, *self.I_shape
        C, O_h, O_w = self.O_shape
        
        grad = grad.transpose(0, 2, 3, 1).reshape(-1, 1)
        if self.mode == "cpu":
            self.grad_x = np.zeros((grad.size, self.pool*self.pool))
        elif self.mode == "gpu":
            self.grad_x = cp.zeros((grad.size, self.pool*self.pool))
        #self.grad_x[:, self.max_index] = grad
        self.grad_x = self.max_index*grad
        self.grad_x = self.grad_x.reshape(B*O_h*O_w, C*self.pool*self.pool).T
        self.grad_x = self.col2im(self.grad_x, I_shape, self.O_shape,
                                  stride=self.pool, pad=self.pad_state)
        
        return self.grad_x
    
    
    def update(self, **kwds):
        pass

Let's experiment.

`search_bottleneck.py`


#Backpropagation of pooling layer
%%timeit
err1 = lm[1].backward(err2)

#----------output----------
# 1000 loops, best of 3: 280 µs per loop
#--------------------------

#Value filling
%%timeit
grad_x1 = lm[1].max_index*grad

#----------output----------
# 100000 loops, best of 3: 16.3 µs per loop
#--------------------------

By the way, it is thought that the above results are assigned to a GPU different from the previous experimental results, so it is delicate to compare the results in general, but for the time being, we have succeeded in speeding up. There is no doubt about it. In this case, the col2im function etc. will become a problem this time, so there is still room for speeding up around that.

Also, as a whole

progress:[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX]287s/285s
training dataset
correct: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7]
predict: [5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7]
accuracy rate: 99.21333333333334 % (59528/60000)
test dataset
correct: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5]
predict: [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5]
accuracy rate: 98.03 % (9803/10000)

In this way, we were able to shorten it to about 50s per epoch! Also, since the learning time per mini-batch is about 6ms, the learning time per epoch is $ 6 \ times 7500 \ times 10 ^ {-3} = 45s $. ~~ And the previous mismatch has been resolved ... What was it after all? I should have experimented while assigning to the same GPU ... well. ~~

in conclusion

With this kind of feeling, we will search for the bottleneck part, improve it, and speed it up. We will continue to improve it from time to time.

When the P.S. mini-batch size was set to 128, the execution time was almost the same as the experiment with Keras. It was good.

Deep learning series

[^ 1]: Strictly speaking, "the semiconductor integration rate doubles in 18 months (24 months)".

Deep Learning Gaiden ~ GPU Programming ~

Overview

table of contents

Accelerate deep learning

GPU programming with CuPy

Installation and confirmation of CuPy

CuPy programming

activator.py

Confirmation of effect

cnn_main.py

For further speeding up

mnist_cnn.py

Time measurement of error calculation

search_bottleneck.py

Time measurement of the learning part

search_bottleneck.py

Backpropagation time measurement

search_bottleneck.py

Time measurement of back propagation of pooling layer

search_bottleneck.py

Speeding up the pooling layer

pool.py

search_bottleneck.py

in conclusion

Deep learning series

GPU programming with `CuPy`

Installation and confirmation of `CuPy`

`CuPy` programming

`activator.py`

`cnn_main.py`

`mnist_cnn.py`

`search_bottleneck.py`

`search_bottleneck.py`

`search_bottleneck.py`

`search_bottleneck.py`

`pool.py`

`search_bottleneck.py`