Introduction

Languages "Python" that are classified as dynamically typed include "Cython" that converts Python code to C / C ++ and compiles it, and "Numba" that accelerates Python using a JIT compiler. I wanted to improve efficiency by incorporating these into the Python code used in research and internships, and in order to deepen my understanding, I tried to compare and verify the processing speed for actual simple processing on jupyter. ..

This article is not an introductory article on Cython. What is Cython? If you want to know the basic knowledge and usage, we recommend the following articles.

-Introduction to cython -Introduction to Cython without going deep

Please feel free to comment if you have any opinions or mistakes. The experimental code is here

Experiment 1 Comparison with is_prime function

I tried to increase the speed by using the simple ʻis_prime` function (primality test function) shown below as a baseline. About the code The same part as the baseline is omitted. Regarding Cython, we verified the difference in speed depending on the type location.

baseline


def find_factor(n):
    answer = n
    for i in range (2,n):
        if n % i == 0:
            answer = i
            break
    return answer

def is_prime(n):
    if find_factor (n) != n:
        return False
    else:
        return True

Speed up with Numba

from numba import jit
@jit
def find_factor0 (n):
    #Abbreviation
def is_prime0 (n):
    #Abbreviation

Acceleration with Cython 1 (Cython1)

Just compile with Cython without changing the code

% load_ext Cython
%% cython
def find_factor1(n):
    #Abbreviation
def is_prime1(n):
    #Abbreviation

Acceleration with Cython 2 (Cython2)

Type all variables

% load_ext Cython
%% cython
def find_factor2(int n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime2(n):
    #Abbreviation

Acceleration with Cython 2-1 (Cython2-1)

Type definition only i of for statement

% load_ext Cython
%% cython
def find_factor2_1(n):
    answer = n
    cdef int i
    #Abbreviation
def is_prime2_1(n):
    #Abbreviation

Acceleration with Cython 2-2 (Cython2-2)

Type only answer

% load_ext Cython
%% cython
def find_factor2_2(n):
    cdef int answer = n
    #Abbreviation
def is_prime2_2(n):
    #Abbreviation

Speeding up with Cython 2-3 (Cython2-3)

Type definition only argument n

% load_ext Cython
%% cython
def find_factor2_3(int n):
    #Abbreviation
def is_prime2_3(int n):
    #Abbreviation

Acceleration with Cython 2-4 (Cython2-4)

Type definition other than argument n

% load_ext Cython
%% cython
def find_factor2_4(n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime2_4(n):
    #Abbreviation

Acceleration with Cython 3 (Cython3)

Define the findfactor function with cdef

% load_ext Cython
%% cython
cdef int find_factor3(int n):
    cdef int answer = n
    cdef int i
    #Abbreviation
def is_prime3(n):
    #Abbreviation

In all experiments, the time required to determine the prime number 131071 was measured by % timeit. The average time and standard deviation of the processes executed 100 times each are as follows.

Comparison of speed of is_prime function

	Baseline	Numba	Cython1 No code change	Cython2 The function itself	Cython3 All variables in the function
time(ms)	8.69±0.2	0.49±0.0	5.34±0.1	0.43±0.0	0.43±0.0

Comparison of speed by type definition in Cython

	Cython2-1 for statement i only	Cython2-2 answer only	Cython2-3 Argument n only	Cython2-4 Other than argument n
time(ms)	4.81±0.1	5.27±0.4	6.62±0.1	4.85±0.1

From the result of Cython2, it is the fastest for Cython when all the type information is defined. Now let's consider the change in velocity due to the difference in the type definition. As you can see from the difference between Cython2-1 and Cython2-2, the effect of type-defining variables is largely due to the former (type-defining iterator variable ʻi). This is because the type check is entered for each operation in the increment calculation performed in the for loop (equivalent to the process of calling ʻint .__ add__ every time), so if it is executed a lot, the overhead will be high, but the type definition It is thought that this overhead is eliminated by calculating ʻi + 1` directly on the C language.

Also, paying attention to Cython2-3, the argument-only type definition is slower than Cython1. I thought it should be faster by removing the dynamic type checking in the arguments when calling the find factor function, but it's actually slower. Considering that there is a compilation overhead due to the argument type definition rather than those effects, we defined the following functions hoge1, hoge2, which do nothing but differ only in the presence or absence of the argument type definition, and compared the speeds. However, there was a slight difference.

%% cython
def hoge1 (n,a,b,c):
    return
def hoge2 (int n, int a, int b, int c):
    return

However, even with the above function, the difference was about 10 ns (10 million loops are rotated and the difference should not be an error from the standard deviation value), and in addition, there was almost no difference when there was one argument, so this overhead It is not considered that the difference is due to. You can also see from Cython2-4 that it is faster to type the arguments if all the variables in the find factor are typed. From the above, the cause of the delay in Cython2-3 is the dynamic when only one of the n% i == 0 operations is type-defined than when neither n nor i is type-defined. I think the conclusion is that the overhead of the check is larger, but I didn't understand the reason.

The result of typing the find factor itself from Cython3 was not much different from Cython2. The arithmetic processing itself in find factor is the same as Cython2, so I'm convinced. However, in Cython3, the overhead when calling find factor should be removed, so calling find factor once makes almost no difference, but I think that Cython3 will be faster in the process of calling a lot. I created the following function and verified it.

%% cython

cdef int find_factor3_1(int n):
    #Abbreviation
def find_factor3_2(int n):
    #Abbreviation

def is_prime_inf1(n,m):
    for i in range (m):
        find_factor3_1 (n)

def is_prime_inf2(n,m):
    for i in range (m):
        find_factor3_2 (n)

Comparison when the find factor is called 10,000 times

	find factor3-1(cdef)	find factor3-2(def)
time	158 μs	3.1s

As expected, calling a lot of find factors made a significant difference. Therefore, it was found that the overhead of calling by defining the find factor itself is removed.

Experiment 2 Comparison by matrix product

I tried to increase the speed using the matrix product as a baseline.

Implementation by Python

def dot(a, b):
    n = len(a)
    l = len(a[0])
    m = len(b[0])
    c = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            for k in range(l):
                c[i][j] += a[i][k] * b[k][j]
    return c

Implementation by Numba

from numba import jit
@jit
def nm_dot(a, b):
    #Abbreviation

Implementation by Cython 1 (Cython1)

For the time being, type definition is done based on the knowledge so far and implemented with cython. Change to [i] [j] → [i, j] to speed up, declare type definitions together, change to range → xrange, turn off the check function, etc. We are devising.

%%cython
import numpy as np
cimport numpy as np

cython: boundscheck=False
cython: wraparound=False
    
cpdef np.ndarray  c_dot(np.ndarray a, np.ndarray b):
    cdef np.ndarray c
    cdef int n,l,m,i,j,k
    n = len(a)
    l = len(a[0])
    m = len(b[0])
    c = np.zeros((n, m))
    for i in xrange(n):
        for j in xrange(m):
            for k in xrange(l):
                c[i,j] += a[i,k] * b[k,j]
    return c

Implementation by Cython 2 (Cython2)

I made the type definition for the numpy array more strictly by referring to the official document.

cpdef np.ndarray[dtype=float,ndim = 2]  c_dot2(np.ndarray[dtype=float,ndim = 2] a, np.ndarray[dtype=float,ndim = 2] b):
    cdef np.ndarray[dtype=double,ndim = 2] c
    #Abbreviation

In this experiment, we generated a 100 * 100 random Numpy matrix and measured it with % timeit as in Experiment 1. The average and standard deviation of 10 processes for the time-consuming implementation of Python and Cython 1 and 100 processes for other cases are as follows.

Comparison of matrix product speeds

	Python	Python(Numpy)	Numba	Cython1	Cython2
time(ms)	1220±54	0.033	1.1	729±16	2.5

It's a bit faster in Cython1 than the naive Python implementation, but much faster in Cython2. In the implementation in Cython1, when the type of each numpy array is defined as np.ndarray, the actual dtype and dimension are not yet determined and it is subject to dynamic check after all, which is equivalent to the processing when Cython is not used. It is thought that this is because it is stored. As you can see from the results, Numpy was overwhelmingly fast when it came to calculating matrix products. In the first place, I used Numpy a lot, but I didn't know the principle why it was so fast, so I did a little research (I should have done it first). , It was said that it was implemented in C language internally. Here too, typing came out and I realized the importance of typing. At least for calculations implemented as modules in numpy such as matrix products, it is better to use numpy as it is, Cython when it is not implemented as a numpy module and can not be realized without actually accessing the numpy array many times. It seems good to use.

in conclusion

Through this experiment, the practical introduction of Cython does not type in the dark clouds, but efficiently improves performance while considering the advantages of Python such as code flexibility and readability, and the use of excellent libraries such as Numpy. I felt that it was good to determine the part to be used.

Reference material