When reading the Chainer functions code, you may see calls to cuda.elementwise
or cuda.reduce
.
These are methods for performing your own processing on the GPU.
It can be said that it is an indispensable method for implementing Chainer function, and it seems necessary to become an intermediate Chainer, so I investigated it.
This article deals with cuda.elementwise
and does not discuss cuda.reduce
.
A description of cuda.elementwise
can be found below.
http://docs.chainer.org/en/stable/cupy-reference/kernel.html
There is also a commentary by Mr. Okuda of Preferred Networks on SlideShare.
http://www.slideshare.net/ryokuta/cupy
cuda.elementwise
defines the CUDA kernel.
A CUDA kernel is a program that runs on a GPU that supports CUDA.
Calling cuda.elementwise
gives the return value a function kernel invocation function to run the CUDA kernel, and calling the kernel invocation function runs the CUDA kernel on the GPU.
The substance of the kernel invocation function is an ElementwiseKernel object, which is callable.
As a first example, we will increment all the elements of a given array.
First, import the required modules as shown below and define xp
as a reference to cuda.cupy
.
Subsequent sample code assumes that you are running the following code:
import numpy as np
import chainer
from chainer import cuda
xp = cuda.cupy
Then use cuda.elementwise
to increment the array elements.
x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
y = cuda.elementwise(
'T x',
'T y',
'y = x + 1;',
'sample1_fwd',
)(x)
print(y)
The output looks like this: You can see that all the elements are incremented. It takes some time to output after after the output of before, but it seems that the compilation is done by nvcc behind the scenes.
[[ 2. 3. 4.]
[ 5. 6. 7.]]
cuda.elementwise
is used in two stages as follows.
However, in actual code, it is often the case that two processes are described together like cuda.elementwise (...) (x)
.
cuda.elementwise
to generate kernel invocation functionThe explanation of the argument / return value of cuda.elementwise
is not this method, but [Document of cupy.ElementwiseKernel
](http://docs.chainer.org/en/stable/cupy-reference/kernel.html# It is described in cupy.ElementwiseKernel).
cuda.elementwise
internally calls cupy.ElementwiseKernel
and both arguments are the same (except that name
is required in cuda.elementwise
).
The following arguments are required for cuda.elementwise
. There are other optional arguments, but I haven't fully understood them yet, so I won't explain them.
in_params
Specifies the string that declares the input arguments. The argument requires a type and argument name. If the type string is one character, it will be ** type placeholder **. The type represented by type placeholder is the type of the variable passed when executing the kernel invocation function. This is useful when you want to declare variables of the same type.
In the sample code, in_params was as follows.
'T x',
The following is expressed by this character string.
out_params
Specifies the string that declares the output arguments. Like in_params, the argument requires a type and argument name.
In the sample code, out_params was as follows.
'T y',
The following is expressed by this character string.
operation
Specify the character string that defines the process to be executed.
In the sample code, it was a process of substituting x + 1
for y
.
'y = x + 1;',
name
The name of the process. Looking at the implementation of the module under chainer.functions, it is "function_name_fwd" for forward processing and "function_name_bwd" for backward processing.
Running the kernel invocation function runs the defined CUDA kernel. Pass the variable corresponding to in_params as an argument at the time of execution. The variable corresponding to out_params can be omitted, but it can also be explicitly passed by specifying it after the variable corresponding to in_params. The return value is the argument specified by out_params. If multiple arguments are specified for out_params, the return value will be those tuples.
Let's also pass a value to out_params. Just pass the value corresponding to out_params to the argument of kernel invocation function.
x = xp.asarray([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)
y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample2_fwd',
)(x, y)
print(y)
The execution result is as follows, and x is added to the original y.
[[ 2. 3. 4.]
[ 6. 7. 8.]]
Broadcasting
The array will be broadcast automatically.
x = xp.asarray([1, 2, 3], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)
y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample3_fwd',
)(x, y)
print(y)
Execution result:
[[ 2. 3. 4.]
[ 3. 4. 5.]]
If the size of the array does not match and you cannot broadcast, an error will occur.
x = xp.asarray([1, 2], dtype=np.float32)
y = xp.asarray([[1, 1, 1], [2, 2, 2]], dtype=np.float32)
y = cuda.elementwise(
'T x',
'T y',
'y += x;',
'sample4_fwd',
)(x, y)
print(y)
Execution result:
Traceback (most recent call last):
File "elementwise_sample.py", line 61, in <module>
)(x, y)
File "cupy\core\elementwise.pxi", line 508, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:34118)
File "cupy\core\elementwise.pxi", line 334, in cupy.core.core._broadcast (cupy\core\core.cpp:31734)
File "cupy\core\core.pyx", line 1504, in cupy.core.core.broadcast.__init__ (cupy\core\core.cpp:50697)
ValueError: Broadcasting failed
Indexing
Often you want to specify an index when working with an array. You can specify the index by doing the following.
raw
to the variable you want to access by specifying the index_ind.size ()
represents the number of indexesHere is a sample that reverses the elements of an array.
x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)
y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample5_fwd',
)(x, y)
print(y)
The execution result is as follows.
[ 4. 3. 2. 1.]
You can think of the above code as running the following code using Numpy on the GPU.
x = np.asarray([1, 2, 3, 4], dtype=np.float32)
y = np.zeros_like(x, dtype=np.float32)
i = np.arange(4)
y = x[4 - i - 1]
Note that you need to pass y
to the kernel invocation function.
If you do not pass y
as shown below, you will get the error Value Error: Loop size is Undecided
.
This seems to happen because you can't size the index with only raw arguments.
x = xp.asarray([1, 2, 3, 4], dtype=np.float32)
y = xp.zeros_like(x, dtype=np.float32)
y = cuda.elementwise(
'raw T x',
'T y',
'y = x[_ind.size() - i - 1];',
'sample6_fwd',
)(x)
print(y)
Consider getting x [t [i]]
(i = 0, 1, 2, ...) when x
is a two-dimensional array and t
is a one-dimensional array. ..
This can be written as follows.
x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
t = xp.asarray([0, 2, 1], dtype=np.int32)
y = cuda.elementwise(
'raw T x, S t',
'T y',
'int ind[] = {i, t}; y = x[ind];',
'sample7_fwd',
)(x, t)
print(y)
Execution result:
[ 1. 6. 8.]
ʻInt ind [] = {i, t}; `generates an index indicating [(0, t [0]), (1, t [1]), (2, t [2])] ..
You can use C (nvcc? To be exact) syntax such as for
and while
.
As an example, calculate the cumulative value for each column of x.
Makes y [i, j]
cumulative from x [0, j]
to x [i, j]
.
x = xp.asarray([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.float32)
y = xp.zeros_like(x)
y = cuda.elementwise(
'raw T x, int32 c',
'raw T y',
'''
int ind[] = {0, i};
y[ind] = x[ind];
for (int j = 1; j < c; j++) {
int ind[] = {j, i};
int prev_ind[] = {j - 1, i};
y[ind] = y[prev_ind] + x[ind];
}
''',
'sample8_fwd',
)(x, x.shape[0], y, size=x.shape[1])
print(y)
Execution result:
[[ 1. 2. 3.]
[ 5. 7. 9.]
[ 12. 15. 18.]]
You can also use CUDA functions. However, it is unknown how much support it has. Let's use ʻatomic Add` as an example.
x = xp.zeros((3, 3), dtype=np.float32)
t = xp.asarray([0, 1, 2], dtype=np.int32)
y = cuda.elementwise(
'S t',
'raw T x',
'int ind[] = {i, t}; atomicAdd(&x[ind], 1);',
'sample9_fwd',
)(t, x)
print(y)
Execution result:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
By understanding cuda.elementwise
, you will deepen your understanding of Chainer.
cuda.elementwise
and cuda.reduce
are commonly used inside Chainer, and if you want to know more, you should refer to them.
Recommended Posts