I investigated how the execution speed changes depending on how to write Cython functions.
# | Factor | Choices |
---|---|---|
1 | Syntax to use | ① Python for statement ② Cython for statement |
2 | Argument type | (1) Arbitrary iterable (2) Numpy array (3) Typed memory view |
3 | Argument element type specification | ① None ② Yes |
4 | Element access method | ① Non-index specification (v in vs) ② Index specification (vs)[i]) |
5 | Presence or absence of check function | ① None ② Yes |
`for i in range (vs.shape [0]):`
and `vs [i] It seems better to have index access in the form of ``` **. ** You should avoid writing
`for v in vs: ``` as if it were a list comprehension **.Take the sum of squares of the following sequence. We have two versions, a Python list and a Numpy array.
python
import numpy as np
vs_list = [1.0,]*10**6
vs_ndarray = np.ones((10**6,), dtype=np.double)
The performance of the original Python for loop was as follows [^ 2]. It takes 300-400ms to get the sum of the squares of $ 10 ^ 6 $ real numbers [^ 3]. ** In Python, it seems faster to use the iterator obediently without index access **.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
1 | Python for statement | Any iterable | None | v in vs | Yes | 302 |
2 | Python for statement | Any iterable | None | vs[i] | Yes | 381 |
1
def sum_py(vs):
s = 0
for v in vs:
s += v
return s
python
%timeit square_sum_py(vs_list)
2
def square_sum_py_range(vs):
s = 0
for i in range(len(vs)):
s += vs[i]**2
return s
python
%timeit square_sum_py_range(vs_list)
Not surprisingly, Numpy broadcasts can be dramatically faster. However, Cython is actually used when the broadcast function cannot be used, so we need to consider other methods here. You can see that processing a Numpy array with a Python for statement slows it down.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
3 | Numpy broadcast | Numpy array | Yes | None (unnecessary) | Yes | 17 |
4 | Python for statement | Numpy array | Yes | v in vs | Yes | 1640 |
5 | Python for statement | Numpy array | Yes | vs[i] | Yes | 1950 |
3
def square_sum_np(vs):
return np.sum(vs**2)
python
%timeit square_sum_np(vs_ndarray)
4
#Function defined above
%timeit square_sum_py(vs_ndarray)
5
#Function defined above
%timeit square_sum_py_range(vs_ndarray)
So, next, let's think about using Cython to speed up the Python for statement. A way to pass a low-level memory-accessible array to a function in Cython is to specify a Numpy array as a direct argument [http://docs.cython.org/en/latest/src/tutorial/numpy] There are .html) and How to specify a typed memory view as an argument. Here, we will start with the more intuitive former method. Looking at the fragmentary information on the net, it seems that only typing is effective in speeding up Cython. However, as you can see in the table below, just specifying the function arguments in a Numpy array isn't much faster than the original Python code (1) (6). At this stage, specifying the element types of the Numpy array does little (7). Of course, it's faster than Python code (4) using Numpy arrays, but that's all there is no reason to use Cython.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
6 | Cython for statement | Numpy array | None | v in vs | Yes | 378 |
7 | Cython for statement | Numpy array | Yes | v in vs | Yes | 362 |
6
%%cython
cimport numpy as np
def square_sum_cy_np(np.ndarray vs):
cdef double v, s = 0.0
for v in vs:
s += v**2
return s
python
%timeit square_sum_cy_np(vs_ndarray)
7
%%cython
cimport numpy as np
def square_sum_cy_np_typed(np.ndarray[np.double_t, ndim=1] vs):
cdef double v, s = 0.0
for v in vs:
s += v**2
return s
python
%timeit square_sum_cy_np_typed(vs_ndarray)
Then, what to do is to change the access method (for statement writing method) to the array element. Index access without element typing is slower, but index access with element typing is dramatically faster.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
8 | Cython for statement | Numpy array | None | vs[i] | Yes | 1610 |
9 | Cython for statement | Numpy array | Yes | vs[i] | Yes | 28 |
8
%%cython
cimport numpy as np
def square_sum_cy_np_range(np.ndarray vs):
cdef double s = 0.0
for i in range(vs.shape[0]):
s += vs[i]**2
return s
python
%timeit square_sum_cy_np_range(vs_ndarray)
9
%%cython
cimport numpy as np
def square_sum_cy_np_typed_range(np.ndarray[np.double_t, ndim=1] vs):
cdef double s = 0.0
for i in range(vs.shape[0]):
s += vs[i]**2
return s
python
%timeit square_sum_cy_np_typed_range(vs_ndarray)
The official documentation also describes how to omit the array access check function (10-13), but the difference was small compared to not omitting the check function (6-9). It's true that 10 to 20% is faster, but it's only as effective as the last push.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
10 | Cython for statement | Numpy array | None | v in vs | None | 315 |
11 | Cython for statement | Numpy array | Yes | v in vs | None | 313 |
12 | Cython for statement | Numpy array | None | vs[i] | None | 1610 |
13 | Cython for statement | Numpy array | Yes | vs[i] | None | 25 |
10
%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_nocheck(np.ndarray vs):
cdef double v, s = 0.0
with boundscheck(False), wraparound(False):
for v in vs:
s += v**2
return s
python
%timeit square_sum_cy_np_nocheck(vs_ndarray)
11
%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_typed_nocheck(np.ndarray[np.double_t, ndim=1] a):
cdef double d, s = 0.0
with boundscheck(False), wraparound(False):
for d in a:
s += d**2
return s
python
%timeit square_sum_cy_np_typed_nocheck(vs_ndarray)
12
%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_range_nocheck(np.ndarray a):
cdef double s = 0.0
with boundscheck(False), wraparound(False):
for i in range(a.shape[0]):
s += a[i]**2
return s
python
%timeit square_sum_cy_np_range_nocheck(vs_ndarray)
13
%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_typed_range_nocheck(np.ndarray[np.double_t, ndim=1] a):
cdef double s = 0.0
with boundscheck(False), wraparound(False):
for i in range(a.shape[0]):
s += a[i]**2
return s
python
%timeit square_sum_cy_np_typed_range_nocheck(vs_ndarray)
Another way to pass an array with low-level memory access to the Cython function is to specify a typed memory view as an argument. This method is recommended in the official documentation. As you can see from the results below, it is the method of accessing the array elements that works in this case as well.
# | syntax | Argument type | Element type specification | Element access | Check function | Execution time[ms] |
---|---|---|---|---|---|---|
14 | Cython for statement | Typed memory view | Yes (necessary) | v in vs | Yes | 519 |
15 | Cython for statement | Typed memory view | Yes (necessary) | vs[i] | Yes | 26 |
16 | Cython for statement | Typed memory view | Yes (necessary) | v in vs | None | 516 |
17 | Cython for statement | Typed memory view | Yes (necessary) | vs[i] | None | 24 |
14
%%cython
def square_sum_cy_mv(double[:] vs):
cdef double v, s = 0.0
for v in vs:
s += v**2
return s
python
%timeit square_sum_cy_np_typed_range_nocheck(vs_ndarray)
15
%%cython
def square_sum_cy_mv_range(double[:] vs):
cdef double s = 0.0
for i in range(vs.shape[0]):
s += vs[i]**2
return s
python
%timeit square_sum_cy_mv_range(vs_ndarray)
16
%%cython
from cython import boundscheck, wraparound
def square_sum_cy_mv_nocheck(double[:] vs):
cdef double v, s = 0.0
with boundscheck(False), wraparound(False):
for v in vs:
s += v**2
return s
python
%timeit square_sum_cy_mv_nocheck(vs_ndarray)
17
%%cython
from cython import boundscheck, wraparound
def square_sum_cy_mv_range_nocheck(double[:] vs):
cdef double s = 0.0
with boundscheck(False), wraparound(False):
for i in range(vs.shape[0]):
s += vs[i]**2
return s
python
%timeit square_sum_cy_mv_range_nocheck(vs_ndarray)
python
import numpy as np
%load_ext Cython
python
vs = np.ones((10**3,10**3), dtype=np.double)
python
%%cython
from cython import boundscheck, wraparound
cdef double square_sum(double[:, :] vs):
# def square_sum(double[:, :] vs):Even if it is almost the same in this case
cdef:
double s = 0.0
Py_ssize_t nx = vs.shape[0]
Py_ssize_t ny = vs.shape[1]
Py_ssize_t i, j
with boundscheck(False), wraparound(False):
for i in range(nx):
for j in range(ny):
s += vs[i, j]**2
return s
python
%timeit square_sum(vs)
[^ 1]: The number of codes is not $ 2 \ times3 \ times2 \ times2 \ times2 = 24 $ because there is a correlation between the choices and (2) the code for reference is added. It is the cause. [^ 2]: There was a maximum variation of ± 20% in the measured value by% timeit, but here we are concerned about the difference of several times to several tens of times, so I am concerned about the difference of several percent. I will not. [^ 3]: CPU uses Intel Celeron.
Recommended Posts