Process on GPU using chainer.cuda.elementwise

Introduction

When reading the Chainer functions code, you may see calls to cuda.elementwise or cuda.reduce. These are methods for performing your own processing on the GPU. It can be said that it is an indispensable method for implementing Chainer function, and it seems necessary to become an intermediate Chainer, so I investigated it. This article deals with cuda.elementwise and does not discuss cuda.reduce.

A description of cuda.elementwise can be found below. http://docs.chainer.org/en/stable/cupy-reference/kernel.html There is also a commentary by Mr. Okuda of Preferred Networks on SlideShare. http://www.slideshare.net/ryokuta/cupy

Confirmed environment

What does cuda.elementwise do?

cuda.elementwise defines the CUDA kernel. A CUDA kernel is a program that runs on a GPU that supports CUDA. Calling cuda.elementwise gives the return value a function kernel invocation function to run the CUDA kernel, and calling the kernel invocation function runs the CUDA kernel on the GPU.

The substance of the kernel invocation function is an ElementwiseKernel object, which is callable.

First sample code

As a first example, we will increment all the elements of a given array. First, import the required modules as shown below and define xp as a reference to cuda.cupy. Subsequent sample code assumes that you are running the following code:

import numpy as np
import chainer
from chainer import cuda

xp = cuda.cupy

Then use cuda.elementwise to increment the array elements.

x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y = x + 1;',
'sample1_fwd',
)(x)

print(y)

The output looks like this: You can see that all the elements are incremented. It takes some time to output after after the output of before, but it seems that the compilation is done by nvcc behind the scenes.

[[ 2.  3.  4.]
 [ 5.  6.  7.]]

Explanation of sample code

cuda.elementwise is used in two stages as follows. However, in actual code, it is often the case that two processes are described together like cuda.elementwise (...) (x).

The explanation of the argument / return value of cuda.elementwise is not this method, but [Document of cupy.ElementwiseKernel](http://docs.chainer.org/en/stable/cupy-reference/kernel.html# It is described in cupy.ElementwiseKernel). cuda.elementwise internally calls cupy.ElementwiseKernel and both arguments are the same (except that name is required in cuda.elementwise). The following arguments are required for cuda.elementwise. There are other optional arguments, but I haven't fully understood them yet, so I won't explain them.

in_params

Specifies the string that declares the input arguments. The argument requires a type and argument name. If the type string is one character, it will be ** type placeholder **. The type represented by type placeholder is the type of the variable passed when executing the kernel invocation function. This is useful when you want to declare variables of the same type.

In the sample code, in_params was as follows.

'T x',

The following is expressed by this character string.

out_params

Specifies the string that declares the output arguments. Like in_params, the argument requires a type and argument name.

In the sample code, out_params was as follows.

'T y',

The following is expressed by this character string.

operation

Specify the character string that defines the process to be executed. In the sample code, it was a process of substituting x + 1 for y.

'y = x + 1;',

name

The name of the process. Looking at the implementation of the module under chainer.functions, it is "function_name_fwd" for forward processing and "function_name_bwd" for backward processing.

Execute kernel invocation function

Running the kernel invocation function runs the defined CUDA kernel. Pass the variable corresponding to in_params as an argument at the time of execution. The variable corresponding to out_params can be omitted, but it can also be explicitly passed by specifying it after the variable corresponding to in_params. The return value is the argument specified by out_params. If multiple arguments are specified for out_params, the return value will be those tuples.

Pass values to out_params

Let's also pass a value to out_params. Just pass the value corresponding to out_params to the argument of kernel invocation function.

x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample2_fwd',
)(x, y)

print(y)

The execution result is as follows, and x is added to the original y.

[[ 2.  3.  4.]
 [ 6.  7.  8.]]

Broadcasting

The array will be broadcast automatically.

x = xp.asarray([1, 2, 3], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample3_fwd',
)(x, y)

print(y)

Execution result:

[[ 2.  3.  4.]
 [ 3.  4.  5.]]

If the size of the array does not match and you cannot broadcast, an error will occur.

x = xp.asarray([1, 2], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample4_fwd',
)(x, y)

print(y)

Execution result:

Traceback (most recent call last):
  File "elementwise_sample.py", line 61, in <module>
    )(x, y)
  File "cupy\core\elementwise.pxi", line 508, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:34118)
  File "cupy\core\elementwise.pxi", line 334, in cupy.core.core._broadcast (cupy\core\core.cpp:31734)
  File "cupy\core\core.pyx", line 1504, in cupy.core.core.broadcast.__init__ (cupy\core\core.cpp:50697)
ValueError: Broadcasting failed

Indexing

Often you want to specify an index when working with an array. You can specify the index by doing the following.

Here is a sample that reverses the elements of an array.

x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)

y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample5_fwd',
)(x, y)

print(y)

The execution result is as follows.

[ 4.  3.  2.  1.]

You can think of the above code as running the following code using Numpy on the GPU.

x = np.asarray([1, 2, 3, 4], dtype=np.float32)
y = np.zeros_like(x, dtype=np.float32)
i = np.arange(4)

y = x[4 - i - 1]

Note that you need to pass y to the kernel invocation function. If you do not pass y as shown below, you will get the error Value Error: Loop size is Undecided. This seems to happen because you can't size the index with only raw arguments.

x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)

y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample6_fwd',
)(x)

print(y)

A little more complicated Indexing

Consider getting x [t [i]] (i = 0, 1, 2, ...) when x is a two-dimensional array and t is a one-dimensional array. .. This can be written as follows.

x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
t = xp.asarray([0, 2, 1], dtype=np.int32)

y = cuda.elementwise(
'raw T x, S t',
'T y',
'int ind[] = {i, t}; y = x[ind];',
'sample7_fwd',
)(x, t)

print(y)

Execution result:

[ 1.  6.  8.]

ʻInt ind [] = {i, t}; `generates an index indicating [(0, t [0]), (1, t [1]), (2, t [2])] ..

for loop

You can use C (nvcc? To be exact) syntax such as for and while. As an example, calculate the cumulative value for each column of x. Makes y [i, j] cumulative from x [0, j] to x [i, j].

x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
y = xp.zeros_like(x)

y = cuda.elementwise(
'raw T x, int32 c',
'raw T y',
'''
int ind[] = {0, i};
y[ind] = x[ind];
for (int j = 1; j < c; j++) {
    int ind[] = {j, i};
    int prev_ind[] = {j - 1, i};
    y[ind] = y[prev_ind] + x[ind];
}
''',
'sample8_fwd',
)(x, x.shape[0], y, size=x.shape[1])

print(y)

Execution result:

[[  1.   2.   3.]
 [  5.   7.   9.]
 [ 12.  15.  18.]]

CUDA function

You can also use CUDA functions. However, it is unknown how much support it has. Let's use ʻatomic Add` as an example.


x = xp.zeros((3, 3), dtype=np.float32)
t = xp.asarray([0, 1, 2], dtype=np.int32)

y = cuda.elementwise(
'S t',
'raw T x',
'int ind[] = {i, t}; atomicAdd(&x[ind], 1);',
'sample9_fwd',
)(t, x)

print(y)

Execution result:

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

Finally

By understanding cuda.elementwise, you will deepen your understanding of Chainer. cuda.elementwise and cuda.reduce are commonly used inside Chainer, and if you want to know more, you should refer to them.

Recommended Posts

Process on GPU using chainer.cuda.elementwise
Regression using Gaussian process
Notes on using Alembic
How to play Cyberpunk 2077 on Linux/Ubuntu 20.04 using AMD GPU
Gaussian process regression using GPy
real-time-Personal-estimation (learning using GPU locally)
Try using OpenCV on Windows
[Django] Notes on using django-debug-toolbar
Notes on optimization using Pytorch
Install Chainer 1.6 (GPU) on Windows 7.
Install Caffe on Ubuntu 14.04 (GPU)
Broadcast on LINE using python
Try using Pillow on iPython (Part 1)
Introducing Python using pyenv on Ubuntu 20.04
Preparing python using vscode on ubuntu
Try using Pillow on iPython (Part 2)
Use jupyter on AWS GPU instance
Try using ArUco on Raspberry Pi
DB table insertion process using sqlalchemy
Notes on installing Python using PyEnv
Try using Pillow on iPython (Part 3)
Using a serial console on Ubuntu 20.04
Notes on using rstrip with python.
Install Python on CentOS using Pyenv
Study on Tokyo Rent Using Python (3-3)
Notes on using matplotlib on the server
[For beginners] Process monitoring using cron
Run Yocto on Ubuntu using QEMU.
Install Python on CentOS using pyenv
(Beginner) Notes on using pyenv on Mac