Purpose

I investigated how the execution speed changes depending on how to write Cython functions.

History

'20 / 5/8 First appearance
'20 / 5/10 correction (clarified that %% cython or less and% timeit or less are executed in different cells, float64 → double, remarks have not been verified to work in Jupyter environment, so corrected to work)

background

When doing numerical calculations in Python, use Numpy for speed. If you can't write easily with Numpy, use Numba or Cython, but Numba may not be effective, and Cython's execution time varies greatly depending on how you write it.
Cython speedups tend to be typed like everything, but in reality there are other things to keep in mind. If you don't notice it, you may quickly conclude that "Python is not fast at all" and it may take time while studying other languages.
Therefore, I systematically evaluated the difference in execution time depending on the writing style.

Method

$ N $ real number $ v_ {1}, \ ldots, v_ {n} $ squared $ s = v_ {1} ^ {2} + v_ {2} ^ {2} + \ cdots + v_ { Consider the problem of calculating n} ^ {2} $. For simplicity, let's say $ v_ {1} = v_ {2} = \ cdots = v_ {n} = 1.0 $, $ n = 10 ^ 6 $.
Consider the five types of factors shown in the table below as factors that affect execution speed. We prepared 17 types of code that covered all of these factors and evaluated the execution time [^ 1].
Environment: Python 3.7 / Cython 0.29.17 / Numpy 1.18.1 / Jupyter Notebook

#	Factor	Choices
1	Syntax to use	① Python for statement ② Cython for statement
2	Argument type	(1) Arbitrary iterable (2) Numpy array (3) Typed memory view
3	Argument element type specification	① None ② Yes
4	Element access method	① Non-index specification (v in vs) ② Index specification (vs)[i]）
5	Presence or absence of check function	① None ② Yes

result

** The speed was hardly increased only by specifying the type, but when the index was accessed by specifying the type, the speed was improved by about 10 to 20 times **. The effect of other factors was only hair growth on the error.
Therefore, when migrating from Numpy or Numba to Cython, ** what you really wanted to write in the broadcast is `for i in range (vs.shape [0]):` and `vs [i] It seems better to have index access in the form of ``` **. ** You should avoid writing `for v in vs: ``` as if it were a list comprehension **.

Details

1. Preparation

Take the sum of squares of the following sequence. We have two versions, a Python list and a Numpy array.

`python`


import numpy as np
vs_list = [1.0,]*10**6
vs_ndarray = np.ones((10**6,), dtype=np.double)

2. Python for statement

The performance of the original Python for loop was as follows [^ 2]. It takes 300-400ms to get the sum of the squares of $ 10 ^ 6 $ real numbers [^ 3]. ** In Python, it seems faster to use the iterator obediently without index access **.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
1	Python for statement	Any iterable	None	v in vs	Yes	302
2	Python for statement	Any iterable	None	vs[i]	Yes	381

`1`


def sum_py(vs):
    s = 0
    for v in vs:
        s += v
    return s

`python`


%timeit square_sum_py(vs_list)

`2`


def square_sum_py_range(vs):
    s = 0
    for i in range(len(vs)):
        s += vs[i]**2
    return s

`python`


%timeit square_sum_py_range(vs_list)

3. Numpy broadcast

Not surprisingly, Numpy broadcasts can be dramatically faster. However, Cython is actually used when the broadcast function cannot be used, so we need to consider other methods here. You can see that processing a Numpy array with a Python for statement slows it down.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
3	Numpy broadcast	Numpy array	Yes	None (unnecessary)	Yes	17
4	Python for statement	Numpy array	Yes	v in vs	Yes	1640
5	Python for statement	Numpy array	Yes	vs[i]	Yes	1950

`3`


def square_sum_np(vs):
    return np.sum(vs**2)

`python`


%timeit square_sum_np(vs_ndarray)

`4`


#Function defined above
%timeit square_sum_py(vs_ndarray)

`5`


#Function defined above
%timeit square_sum_py_range(vs_ndarray)

4. Argument type specification

So, next, let's think about using Cython to speed up the Python for statement. A way to pass a low-level memory-accessible array to a function in Cython is to specify a Numpy array as a direct argument [http://docs.cython.org/en/latest/src/tutorial/numpy] There are .html) and How to specify a typed memory view as an argument. Here, we will start with the more intuitive former method. Looking at the fragmentary information on the net, it seems that only typing is effective in speeding up Cython. However, as you can see in the table below, just specifying the function arguments in a Numpy array isn't much faster than the original Python code (1) (6). At this stage, specifying the element types of the Numpy array does little (7). Of course, it's faster than Python code (4) using Numpy arrays, but that's all there is no reason to use Cython.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
6	Cython for statement	Numpy array	None	v in vs	Yes	378
7	Cython for statement	Numpy array	Yes	v in vs	Yes	362

`6`


%%cython
cimport numpy as np
def square_sum_cy_np(np.ndarray vs):
    cdef double v, s = 0.0
    for v in vs:
        s += v**2
    return s

`python`


%timeit square_sum_cy_np(vs_ndarray)

`7`


%%cython
cimport numpy as np
def square_sum_cy_np_typed(np.ndarray[np.double_t, ndim=1] vs):
    cdef double v, s = 0.0
    for v in vs:
        s += v**2
    return s

`python`


%timeit square_sum_cy_np_typed(vs_ndarray)

5. Index access

Then, what to do is to change the access method (for statement writing method) to the array element. Index access without element typing is slower, but index access with element typing is dramatically faster.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
8	Cython for statement	Numpy array	None	vs[i]	Yes	1610
9	Cython for statement	Numpy array	Yes	vs[i]	Yes	28

`8`


%%cython
cimport numpy as np
def square_sum_cy_np_range(np.ndarray vs):
    cdef double s = 0.0
    for i in range(vs.shape[0]):
        s += vs[i]**2
    return s

`python`


%timeit square_sum_cy_np_range(vs_ndarray)

`9`


%%cython
cimport numpy as np
def square_sum_cy_np_typed_range(np.ndarray[np.double_t, ndim=1] vs):
    cdef double s = 0.0
    for i in range(vs.shape[0]):
        s += vs[i]**2
    return s

`python`


%timeit square_sum_cy_np_typed_range(vs_ndarray)

6. Omission of check function

The official documentation also describes how to omit the array access check function (10-13), but the difference was small compared to not omitting the check function (6-9). It's true that 10 to 20% is faster, but it's only as effective as the last push.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
10	Cython for statement	Numpy array	None	v in vs	None	315
11	Cython for statement	Numpy array	Yes	v in vs	None	313
12	Cython for statement	Numpy array	None	vs[i]	None	1610
13	Cython for statement	Numpy array	Yes	vs[i]	None	25

`10`


%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_nocheck(np.ndarray vs):
    cdef double v, s = 0.0
    with boundscheck(False), wraparound(False):
        for v in vs:
            s += v**2
        return s

`python`


%timeit square_sum_cy_np_nocheck(vs_ndarray)

`11`


%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_typed_nocheck(np.ndarray[np.double_t, ndim=1] a):
    cdef double d, s = 0.0
    with boundscheck(False), wraparound(False):
        for d in a:
            s += d**2
    return s

`python`


%timeit square_sum_cy_np_typed_nocheck(vs_ndarray)

`12`


%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_range_nocheck(np.ndarray a):
    cdef double s = 0.0
    with boundscheck(False), wraparound(False):
        for i in range(a.shape[0]):
            s += a[i]**2
    return s

`python`


%timeit square_sum_cy_np_range_nocheck(vs_ndarray)

`13`


%%cython
cimport numpy as np
from cython import boundscheck, wraparound
def square_sum_cy_np_typed_range_nocheck(np.ndarray[np.double_t, ndim=1] a):
    cdef double s = 0.0
    with boundscheck(False), wraparound(False):
        for i in range(a.shape[0]):
            s += a[i]**2
    return s

`python`


%timeit square_sum_cy_np_typed_range_nocheck(vs_ndarray)

Typed memory view

Another way to pass an array with low-level memory access to the Cython function is to specify a typed memory view as an argument. This method is recommended in the official documentation. As you can see from the results below, it is the method of accessing the array elements that works in this case as well.

#	syntax	Argument type	Element type specification	Element access	Check function	Execution time[ms]
14	Cython for statement	Typed memory view	Yes (necessary)	v in vs	Yes	519
15	Cython for statement	Typed memory view	Yes (necessary)	vs[i]	Yes	26
16	Cython for statement	Typed memory view	Yes (necessary)	v in vs	None	516
17	Cython for statement	Typed memory view	Yes (necessary)	vs[i]	None	24

`14`


%%cython
def square_sum_cy_mv(double[:] vs):
    cdef double v, s = 0.0
    for v in vs:
        s += v**2
    return s

`python`


%timeit square_sum_cy_np_typed_range_nocheck(vs_ndarray)

`15`


%%cython
def square_sum_cy_mv_range(double[:] vs):
    cdef double s = 0.0
    for i in range(vs.shape[0]):
        s += vs[i]**2
    return s

`python`


%timeit square_sum_cy_mv_range(vs_ndarray)

`16`


%%cython
from cython import boundscheck, wraparound
def square_sum_cy_mv_nocheck(double[:] vs):
    cdef double v, s = 0.0
    with boundscheck(False), wraparound(False):
        for v in vs:
            s += v**2
    return s

`python`


%timeit square_sum_cy_mv_nocheck(vs_ndarray)

`17`


%%cython
from cython import boundscheck, wraparound
def square_sum_cy_mv_range_nocheck(double[:] vs):
    cdef double s = 0.0
    with boundscheck(False), wraparound(False):
        for i in range(vs.shape[0]):
            s += vs[i]**2
    return s

`python`


%timeit square_sum_cy_mv_range_nocheck(vs_ndarray)

Consideration

Looking at the above results, it seems better to write something like 15 or 17 when using Cython to make up for something that cannot be written in Numpy. Of course, the tendency will be different depending on the type of problem, but I think that this area will be the first step when using Cython next time.
As a concern, the CPU used for the evaluation may have been a little too slow [^ 3]. Also, if you want to do multi-parallel calculations using the GPU, you need to rethink from the beginning.

Remarks

As shown in Official document, the memory allocation of the array (vs) passed as an argument is C. If you specify it as a type, it will be slightly faster. I didn't care about the method of explicitly compiling, but it seems that it will not be easy if you put restrictions such as (1) calculating the array vs outside the Cython function and (2) using the magic command of Jupyter. I haven't organized this area, so I'll add it when I understand it.
It seems easy to write the following in a multidimensional array.

`python`


import numpy as np
%load_ext Cython

`python`


vs = np.ones((10**3,10**3), dtype=np.double)

`python`


%%cython
from cython import boundscheck, wraparound
cdef double square_sum(double[:, :] vs):
# def square_sum(double[:, :] vs):Even if it is almost the same in this case
    cdef:
        double s = 0.0
        Py_ssize_t nx = vs.shape[0]
        Py_ssize_t ny = vs.shape[1]
        Py_ssize_t i, j
    with boundscheck(False), wraparound(False):
        for i in range(nx):
            for j in range(ny):
                s += vs[i, j]**2
    return s

`python`


%timeit square_sum(vs)

[^ 1]: The number of codes is not $ 2 \ times3 \ times2 \ times2 \ times2 = 24 $ because there is a correlation between the choices and (2) the code for reference is added. It is the cause. [^ 2]: There was a maximum variation of ± 20% in the measured value by% timeit, but here we are concerned about the difference of several times to several tens of times, so I am concerned about the difference of several percent. I will not. [^ 3]: CPU uses Intel Celeron.

Difference in execution speed depending on how to write Cython function

Purpose

History

background

Method

result

Details

1. Preparation

python

2. Python for statement

1

python

2

python

3. Numpy broadcast

3

python

4

5

4. Argument type specification

6

python

7

python

5. Index access

8

python

9

python

6. Omission of check function

10

python

11

python

12

python

13

python

Typed memory view

14

python

15

python

16

python

17

python

Consideration

Remarks

python

python

python

python

`python`

`1`

`python`

`2`

`python`

`3`

`python`

`4`

`5`

`6`

`python`

`7`

`python`

`8`

`python`

`9`

`python`

`10`

`python`

`11`

`python`

`12`

`python`

`13`

`python`

`14`

`python`

`15`

`python`

`16`

`python`

`17`

`python`

`python`

`python`

`python`

`python`