Introduction

When reading the Chainer functions code, you may see calls to cuda.elementwise or cuda.reduce. These are methods for performing your own processing on the GPU. It can be said that it is an indispensable method for implementing Chainer function, and it seems necessary to become an intermediate Chainer, so I investigated it. This article deals with cuda.elementwise and does not discuss cuda.reduce.

A description of cuda.elementwise can be found below. http://docs.chainer.org/en/stable/cupy-reference/kernel.html There is also a commentary by Mr. Okuda of Preferred Networks on SlideShare. http://www.slideshare.net/ryokuta/cupy

Confirmed environment

Windows 10
Python 2.7
Chainer 1.9.0
CUDA 7.5

What does cuda.elementwise do?

cuda.elementwise defines the CUDA kernel. A CUDA kernel is a program that runs on a GPU that supports CUDA. Calling cuda.elementwise gives the return value a function kernel invocation function to run the CUDA kernel, and calling the kernel invocation function runs the CUDA kernel on the GPU.

The substance of the kernel invocation function is an ElementwiseKernel object, which is callable.

First sample code

As a first example, we will increment all the elements of a given array. First, import the required modules as shown below and define xp as a reference to cuda.cupy. Subsequent sample code assumes that you are running the following code:

import numpy as np
import chainer
from chainer import cuda

xp = cuda.cupy

Then use cuda.elementwise to increment the array elements.

x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y = x + 1;',
'sample1_fwd',
)(x)

print(y)

The output looks like this: You can see that all the elements are incremented. It takes some time to output after after the output of before, but it seems that the compilation is done by nvcc behind the scenes.

[[ 2.  3.  4.]
 [ 5.  6.  7.]]

Explanation of sample code

cuda.elementwise is used in two stages as follows. However, in actual code, it is often the case that two processes are described together like cuda.elementwise (...) (x).

Call cuda.elementwise to generate kernel invocation function
Call the generated kernel invocation function to execute the process

The explanation of the argument / return value of cuda.elementwise is not this method, but [Document of cupy.ElementwiseKernel](http://docs.chainer.org/en/stable/cupy-reference/kernel.html# It is described in cupy.ElementwiseKernel). cuda.elementwise internally calls cupy.ElementwiseKernel and both arguments are the same (except that name is required in cuda.elementwise). The following arguments are required for cuda.elementwise. There are other optional arguments, but I haven't fully understood them yet, so I won't explain them.

in_params(str)
out_params(str)
opearation(str)
name(str)

in_params

Specifies the string that declares the input arguments. The argument requires a type and argument name. If the type string is one character, it will be ** type placeholder **. The type represented by type placeholder is the type of the variable passed when executing the kernel invocation function. This is useful when you want to declare variables of the same type.

In the sample code, in_params was as follows.

'T x',

The following is expressed by this character string.

Declare x as an input argument
The type of x is T
T is type placeholder

out_params

Specifies the string that declares the output arguments. Like in_params, the argument requires a type and argument name.

In the sample code, out_params was as follows.

'T y',

The following is expressed by this character string.

Declare y as an output argument
The type of y is T
T is a type placeholder, so the type of y is the same as the type of x

operation

Specify the character string that defines the process to be executed. In the sample code, it was a process of substituting x + 1 for y.

'y = x + 1;',

name

The name of the process. Looking at the implementation of the module under chainer.functions, it is "function_name_fwd" for forward processing and "function_name_bwd" for backward processing.

Execute kernel invocation function

Running the kernel invocation function runs the defined CUDA kernel. Pass the variable corresponding to in_params as an argument at the time of execution. The variable corresponding to out_params can be omitted, but it can also be explicitly passed by specifying it after the variable corresponding to in_params. The return value is the argument specified by out_params. If multiple arguments are specified for out_params, the return value will be those tuples.

Pass values to out_params

Let's also pass a value to out_params. Just pass the value corresponding to out_params to the argument of kernel invocation function.

x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample2_fwd',
)(x, y)

print(y)

The execution result is as follows, and x is added to the original y.

[[ 2.  3.  4.]
 [ 6.  7.  8.]]

Broadcasting

The array will be broadcast automatically.

x = xp.asarray([1, 2, 3], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample3_fwd',
)(x, y)

print(y)

Execution result:

[[ 2.  3.  4.]
 [ 3.  4.  5.]]

If the size of the array does not match and you cannot broadcast, an error will occur.

x = xp.asarray([1, 2], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)

y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample4_fwd',
)(x, y)

print(y)

Execution result:

Traceback (most recent call last):
  File "elementwise_sample.py", line 61, in <module>
    )(x, y)
  File "cupy\core\elementwise.pxi", line 508, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:34118)
  File "cupy\core\elementwise.pxi", line 334, in cupy.core.core._broadcast (cupy\core\core.cpp:31734)
  File "cupy\core\core.pyx", line 1504, in cupy.core.core.broadcast.__init__ (cupy\core\core.cpp:50697)
ValueError: Broadcasting failed

Indexing

Often you want to specify an index when working with an array. You can specify the index by doing the following.

Add raw to the variable you want to access by specifying the index
The special variable ʻi` represents the index
_ind.size () represents the number of indexes

Here is a sample that reverses the elements of an array.

x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)

y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample5_fwd',
)(x, y)

print(y)

The execution result is as follows.

[ 4.  3.  2.  1.]

You can think of the above code as running the following code using Numpy on the GPU.

x = np.asarray([1, 2, 3, 4], dtype=np.float32)
y = np.zeros_like(x, dtype=np.float32)
i = np.arange(4)

y = x[4 - i - 1]

Note that you need to pass y to the kernel invocation function. If you do not pass y as shown below, you will get the error Value Error: Loop size is Undecided. This seems to happen because you can't size the index with only raw arguments.

x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)

y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample6_fwd',
)(x)

print(y)

A little more complicated Indexing

Consider getting x [t [i]] (i = 0, 1, 2, ...) when x is a two-dimensional array and t is a one-dimensional array. .. This can be written as follows.

x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
t = xp.asarray([0, 2, 1], dtype=np.int32)

y = cuda.elementwise(
'raw T x, S t',
'T y',
'int ind[] = {i, t}; y = x[ind];',
'sample7_fwd',
)(x, t)

print(y)

Execution result:

[ 1.  6.  8.]

ʻInt ind [] = {i, t}; `generates an index indicating [(0, t [0]), (1, t [1]), (2, t [2])] ..

for loop

You can use C (nvcc? To be exact) syntax such as for and while. As an example, calculate the cumulative value for each column of x. Makes y [i, j] cumulative from x [0, j] to x [i, j].

x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
y = xp.zeros_like(x)

y = cuda.elementwise(
'raw T x, int32 c',
'raw T y',
'''
int ind[] = {0, i};
y[ind] = x[ind];
for (int j = 1; j < c; j++) {
    int ind[] = {j, i};
    int prev_ind[] = {j - 1, i};
    y[ind] = y[prev_ind] + x[ind];
}
''',
'sample8_fwd',
)(x, x.shape[0], y, size=x.shape[1])

print(y)

Execution result:

[[  1.   2.   3.]
 [  5.   7.   9.]
 [ 12.  15.  18.]]

CUDA function

You can also use CUDA functions. However, it is unknown how much support it has. Let's use ʻatomic Add` as an example.


x = xp.zeros((3, 3), dtype=np.float32)
t = xp.asarray([0, 1, 2], dtype=np.int32)

y = cuda.elementwise(
'S t',
'raw T x',
'int ind[] = {i, t}; atomicAdd(&x[ind], 1);',
'sample9_fwd',
)(t, x)

print(y)

Execution result:

[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]

Finally

By understanding cuda.elementwise, you will deepen your understanding of Chainer. cuda.elementwise and cuda.reduce are commonly used inside Chainer, and if you want to know more, you should refer to them.

Process on GPU using chainer.cuda.elementwise

Introduction

Confirmed environment

What does cuda.elementwise do?

First sample code

Explanation of sample code

Execute kernel invocation function

Pass values to out_params

A little more complicated Indexing

for loop

CUDA function

Finally