Languages "Python" that are classified as dynamically typed include "Cython" that converts Python code to C / C ++ and compiles it, and "Numba" that accelerates Python using a JIT compiler. I wanted to improve efficiency by incorporating these into the Python code used in research and internships, and in order to deepen my understanding, I tried to compare and verify the processing speed for actual simple processing on jupyter. ..
This article is not an introductory article on Cython. What is Cython? If you want to know the basic knowledge and usage, we recommend the following articles.
-Introduction to cython -Introduction to Cython without going deep
Please feel free to comment if you have any opinions or mistakes. The experimental code is here
I tried to increase the speed by using the simple ʻis_prime` function (primality test function) shown below as a baseline. About the code The same part as the baseline is omitted. Regarding Cython, we verified the difference in speed depending on the type location.
baseline
def find_factor(n):
answer = n
for i in range (2,n):
if n % i == 0:
answer = i
break
return answer
def is_prime(n):
if find_factor (n) != n:
return False
else:
return True
from numba import jit
@jit
def find_factor0 (n):
#Abbreviation
def is_prime0 (n):
#Abbreviation
Just compile with Cython without changing the code
% load_ext Cython
%% cython
def find_factor1(n):
#Abbreviation
def is_prime1(n):
#Abbreviation
Type all variables
% load_ext Cython
%% cython
def find_factor2(int n):
cdef int answer = n
cdef int i
#Abbreviation
def is_prime2(n):
#Abbreviation
Type definition only i of for statement
% load_ext Cython
%% cython
def find_factor2_1(n):
answer = n
cdef int i
#Abbreviation
def is_prime2_1(n):
#Abbreviation
Type only answer
% load_ext Cython
%% cython
def find_factor2_2(n):
cdef int answer = n
#Abbreviation
def is_prime2_2(n):
#Abbreviation
Type definition only argument n
% load_ext Cython
%% cython
def find_factor2_3(int n):
#Abbreviation
def is_prime2_3(int n):
#Abbreviation
Type definition other than argument n
% load_ext Cython
%% cython
def find_factor2_4(n):
cdef int answer = n
cdef int i
#Abbreviation
def is_prime2_4(n):
#Abbreviation
Define the findfactor function with cdef
% load_ext Cython
%% cython
cdef int find_factor3(int n):
cdef int answer = n
cdef int i
#Abbreviation
def is_prime3(n):
#Abbreviation
In all experiments, the time required to determine the prime number 131071 was measured by % timeit
. The average time and standard deviation of the processes executed 100 times each are as follows.
Baseline | Numba | Cython1 No code change |
Cython2 The function itself |
Cython3 All variables in the function |
|
---|---|---|---|---|---|
time(ms) | 8.69±0.2 | 0.49±0.0 | 5.34±0.1 | 0.43±0.0 | 0.43±0.0 |
Cython2-1 for statement i only |
Cython2-2 answer only |
Cython2-3 Argument n only |
Cython2-4 Other than argument n |
|
---|---|---|---|---|
time(ms) | 4.81±0.1 | 5.27±0.4 | 6.62±0.1 | 4.85±0.1 |
From the result of Cython2, it is the fastest for Cython when all the type information is defined. Now let's consider the change in velocity due to the difference in the type definition. As you can see from the difference between Cython2-1 and Cython2-2, the effect of type-defining variables is largely due to the former (type-defining iterator variable ʻi). This is because the type check is entered for each operation in the increment calculation performed in the for loop (equivalent to the process of calling ʻint .__ add__
every time), so if it is executed a lot, the overhead will be high, but the type definition It is thought that this overhead is eliminated by calculating ʻi + 1` directly on the C language.
Also, paying attention to Cython2-3, the argument-only type definition is slower than Cython1. I thought it should be faster by removing the dynamic type checking in the arguments when calling the find factor
function, but it's actually slower. Considering that there is a compilation overhead due to the argument type definition rather than those effects, we defined the following functions hoge1, hoge2
, which do nothing but differ only in the presence or absence of the argument type definition, and compared the speeds. However, there was a slight difference.
%% cython
def hoge1 (n,a,b,c):
return
def hoge2 (int n, int a, int b, int c):
return
However, even with the above function, the difference was about 10 ns (10 million loops are rotated and the difference should not be an error from the standard deviation value), and in addition, there was almost no difference when there was one argument, so this overhead It is not considered that the difference is due to. You can also see from Cython2-4 that it is faster to type the arguments if all the variables in the find factor
are typed. From the above, the cause of the delay in Cython2-3 is the dynamic when only one of the n% i == 0
operations is type-defined than when neither n nor i
is type-defined. I think the conclusion is that the overhead of the check is larger, but I didn't understand the reason.
The result of typing the find factor
itself from Cython3 was not much different from Cython2. The arithmetic processing itself in find factor
is the same as Cython2, so I'm convinced. However, in Cython3, the overhead when calling find factor
should be removed, so calling find factor
once makes almost no difference, but I think that Cython3 will be faster in the process of calling a lot. I created the following function and verified it.
%% cython
cdef int find_factor3_1(int n):
#Abbreviation
def find_factor3_2(int n):
#Abbreviation
def is_prime_inf1(n,m):
for i in range (m):
find_factor3_1 (n)
def is_prime_inf2(n,m):
for i in range (m):
find_factor3_2 (n)
find factor3-1(cdef) | find factor3-2(def) | |
---|---|---|
time | 158 μs | 3.1s |
As expected, calling a lot of find factor
s made a significant difference. Therefore, it was found that the overhead of calling by defining the find factor
itself is removed.
I tried to increase the speed using the matrix product as a baseline.
def dot(a, b):
n = len(a)
l = len(a[0])
m = len(b[0])
c = np.zeros((n, m))
for i in range(n):
for j in range(m):
for k in range(l):
c[i][j] += a[i][k] * b[k][j]
return c
from numba import jit
@jit
def nm_dot(a, b):
#Abbreviation
For the time being, type definition is done based on the knowledge so far and implemented with cython. Change to [i] [j]
→ [i, j]
to speed up, declare type definitions together, change to range
→ xrange
, turn off the check function, etc. We are devising.
%%cython
import numpy as np
cimport numpy as np
cython: boundscheck=False
cython: wraparound=False
cpdef np.ndarray c_dot(np.ndarray a, np.ndarray b):
cdef np.ndarray c
cdef int n,l,m,i,j,k
n = len(a)
l = len(a[0])
m = len(b[0])
c = np.zeros((n, m))
for i in xrange(n):
for j in xrange(m):
for k in xrange(l):
c[i,j] += a[i,k] * b[k,j]
return c
I made the type definition for the numpy array more strictly by referring to the official document.
cpdef np.ndarray[dtype=float,ndim = 2] c_dot2(np.ndarray[dtype=float,ndim = 2] a, np.ndarray[dtype=float,ndim = 2] b):
cdef np.ndarray[dtype=double,ndim = 2] c
#Abbreviation
In this experiment, we generated a 100 * 100 random Numpy matrix and measured it with % timeit
as in Experiment 1. The average and standard deviation of 10 processes for the time-consuming implementation of Python and Cython 1 and 100 processes for other cases are as follows.
Python | Python(Numpy) | Numba | Cython1 | Cython2 | |
---|---|---|---|---|---|
time(ms) | 1220±54 | 0.033 | 1.1 | 729±16 | 2.5 |
It's a bit faster in Cython1 than the naive Python implementation, but much faster in Cython2. In the implementation in Cython1, when the type of each numpy array is defined as np.ndarray, the actual dtype and dimension are not yet determined and it is subject to dynamic check after all, which is equivalent to the processing when Cython is not used. It is thought that this is because it is stored. As you can see from the results, Numpy was overwhelmingly fast when it came to calculating matrix products. In the first place, I used Numpy a lot, but I didn't know the principle why it was so fast, so I did a little research (I should have done it first). , It was said that it was implemented in C language internally. Here too, typing came out and I realized the importance of typing. At least for calculations implemented as modules in numpy such as matrix products, it is better to use numpy as it is, Cython when it is not implemented as a numpy module and can not be realized without actually accessing the numpy array many times. It seems good to use.
Through this experiment, the practical introduction of Cython does not type in the dark clouds, but efficiently improves performance while considering the advantages of Python such as code flexibility and readability, and the use of excellent libraries such as Numpy. I felt that it was good to determine the part to be used.
-Introduction to cython -Introduction to Cython without going deep
Recommended Posts