Hello. Good evening. My name is Tameoka. I have been working as a machine learning engineer at Globis Co., Ltd. since April 2020. At Globis, I am in charge of projects that use machine learning technology and projects that improve the operation of data infrastructure.
I think there are various systems that use machine learning technology, In the current Globis, machine learning is performed in response to user requests. We don't deal with systems that need to return results immediately, We are only dealing with systems that take a certain amount of time to learn and estimate the results asynchronously to the application. Therefore, for now, when writing logic using machine learning technology, we don't really care about speed.
On the other hand, I also do competitive programming in my spare time, I personally like to think about and write fast code.
This time, I will talk about Python code that is often used in machine learning. I have summarized the procedure while actually trying to speed up using various means. I hope you can see it if you like.
--Mac Book Pro 2019 model --Processor: 2.6 GHz 6 core Intel Core i7 --Memory: 16 GB
--Use python: 3.9.1-buster
for the Docker image of the Python runtime environment.
--Resources are set to be fully available.
The directory structure looks like this. The contents of each file will be described later.
I mount it under the fast_python
directory to/
in the Docker container and use it.
tree
fast_python
├── cython
│ ├── prime.c
│ ├── prime.cpython-37m-x86_64-linux-gnu.so
│ ├── prime.py
│ ├── prime.pyx
│ └── setup.py
├── pybind11
│ ├── prime.cpp
│ ├── prime.cpython-37m-x86_64-linux-gnu.so
│ └── prime.py
└── python
├── fast_prime.py
└── prime.py
This time, we will consider the following problems and try to speed up the code that answers them.
For the integer X given as standard input
Find the smallest prime number above X and output to standard output.
However, 2 ≤ X ≤ 10^Let it be 14.
First of all, let's put aside the speedup and just think about how to solve the problem.
Except for the constraint of 2 ≤ X ≤ 10 ^ 14, for integers greater than or equal to X, "whether or not they are prime numbers" are judged one by one. If it is a prime number, I think it will be a simple problem of returning that value. If you express this in the code, it looks like this.
/fast_python/python/prime.py
def is_prime(x: int) -> bool: #A function that determines whether it is a prime number
#I haven't thought about it yet
def minimum_prime_number():
X = int(input().strip())
answer = 0
for i in range(X, (10 ** 14 + 32)):
if is_prime(i):
answer = i
break
print(answer)
if __name__ == '__main__':
minimum_prime_number()
Here, in the process of minimum_prime_number ()
, the stop of range
of the for statement is set to 10 ^ 14 + 32
.
The reason is that the smallest prime number greater than or equal to 10 ^ 14 is 10 ^ 14 + 31.
If 10 ^ 14, which is the maximum value of X under the constraint condition, is input, 10 ^ 14 + 31 should be output, so
This range
should be sufficient in this case.
Next, let us consider the logic for determining whether it is a prime number.
Because a prime number is "a natural number greater than 1 and the only positive divisor is 1 and the number itself" I think the logic to determine whether the integer X is a prime number is as follows. This time we have a constraint of 2 ≤ X, so we don't consider cases where X is less than or equal to 1.
For the integer i from 2 to X, calculate the remainder when X is divided by i.
If there is an integer i with a remainder of 0, it returns False.
Returns True if it does not exist.
When I dropped the above logic into a Python program, it became as follows.
/fast_python/python/prime.py
def is_prime(x: int) -> bool:
if x <= 1:
return False
for i in range(2, x):
if x % i == 0:
return False
return True
def minimum_prime_number():
X = int(input().strip())
answer = 0
for i in range(X, (10 ** 14 + 32)):
if is_prime(i):
answer = i
break
print(answer)
if __name__ == '__main__':
minimum_prime_number()
Let's actually run the program.
bash
# @In a Docker container
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
2 #input
2 #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
3 #input
3 #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
4 #input
5 #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
5 #input
5 #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
6 #input
7 #output
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
7 #input
7 #output
... (The following is omitted)
Of the prime numbers greater than or equal to the input integer, the smallest one is output, which looks good.
Next, enter 10 ^ 14 as the corner case. We expect 10 ^ 14 + 31 as output.
bash
root@xxxxxxxxxxxx:/fast_python/python# python prime.py
1000000000000 #input
...
Oh, no results are returned.
I waited for about 10 minutes, but no results were returned. It seems that the number entered is too large and the for statement loops a lot, which takes a lot of time.
There is no upper limit on the processing time for this issue, but it's a bit annoying that no results are returned after 10 minutes. I'd like to somehow speed up this code so that the results come back quickly.
As a first step in speeding up your code, it's a good idea to identify where the processing is slow. There are likely to be several specific methods, but the profiler gives you a detailed picture of how long each process takes.
cProfile
It seems that there are various Python profiling tools as well,
There is a built-in tool called cProfile
.
Let's use this to see how long each function takes to execute.
If you refer to the Official Document, just execute the following command It seems that you can profile the specified Python code file.
bash
root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile prime.py
This time I want to check the time taken for each process, so sort in the order of tottime
Execute with the -s
option as shown below.
bash
root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime prime.py
What is tottime
?
"The total time spent on a given function (excluding the time spent on calling sub-functions)".
When I actually executed the above command, I got the following output. For the standard input at the time of execution, I specified 10 ^ 7, which takes some time to process but returns the output properly.
bash
root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime prime.py
10000000
10000019
237 function calls in 2.342 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.675 1.675 1.675 1.675 {built-in method builtins.input}
20 0.664 0.033 0.664 0.033 prime.py:1(is_prime)
4 0.000 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1438(find_spec)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:62(_path_join)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)
Looking at the output result of the profiler, the process that takes the longest time is as follows.
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.675 1.675 1.675 1.675 {built-in method builtins.input}
However, this includes the waiting time for standard input. It means that my standard input typing is slow. In short, I think this can be ignored.
The problem is the second line.
ncalls tottime percall cumtime percall filename:lineno(function)
20 0.664 0.033 0.664 0.033 prime.py:1(is_prime)
is_prime ()
is the slowest function in the actual process, which takes more than 0.6 seconds overall.
Apparently, the time taken for other processing is 0.001 seconds or less and can be ignored, and this is the bottleneck.
I think there are some programs that are difficult to speed up, but in this case we will try to speed up by modifying the logic.
If the processing is taking a long time due to the large number of loops of the for statement, the first thing that comes to mind for speeding up the processing is
I think it's about reducing the number of loops.
In fact, in this case, you can reduce the number of loops while meeting the requirements for answering the question.
In is_prime ()
, the number of loops can be from 2 to "the largest integer less than or equal to the square root of X".
The proof comes out when you google, but this site was easy to understand.
The following is the actual application of this to the code.
/fast_python/python/fast_prime.py
import math
def is_prime(x: int) -> bool:
if x <= 1:
return False
for i in range(2, (math.floor(math.sqrt(x)) + 1)): #Set the maximum integer less than or equal to the square root as the upper limit
if x % i == 0:
return False
return True
def minimum_prime_number():
X = int(input().strip())
answer = 0
for i in range(X, (10 ** 14 + 32)):
if is_prime(i):
answer = i
break
print(answer)
if __name__ == '__main__':
minimum_prime_number()
When I actually executed it, it became as follows. The standard input is unchanged from 10 ^ 7.
root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime fast_prime.py
10000000
10000019
277 function calls in 1.744 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.738 1.738 1.738 1.738 {built-in method builtins.input}
20 0.001 0.000 0.001 0.000 fast_prime.py:4(is_prime) # is_prime()Processing time
4 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1438(find_spec)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:62(_path_join)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)
Like this, the processing time of is_prime ()
has been reduced to 0.001 seconds.
The code is_prime ()
before speeding up took 0.664 seconds, so
The processing time is actually 1/664 compared to before the speedup.
In this case, even if you give 10 ^ 14, which took more than 10 minutes to process, to the standard input, the result is likely to be returned.
bash
root@xxxxxxxxxxxx:/fast_python/python# python fast_prime.py
100000000000000
100000000000031
When I actually tried it, the result came back in a few seconds. I'm happy. Let's profile and measure the processing time.
bash
root@xxxxxxxxxxxx:/fast_python/python# python -m cProfile -s tottime fast_prime.py
100000000000000
100000000000031
318 function calls in 3.529 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.602 2.602 2.602 2.602 {built-in method builtins.input}
32 0.922 0.029 0.922 0.029 fast_prime.py:4(is_prime) # is_prime()Processing time
4 0.000 0.000 0.002 0.001 <frozen importlib._bootstrap_external>:1356(find_spec)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:62(_path_join)
16 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:64(<listcomp>)
... (The following is omitted)
The processing time for is_prime ()
was 0.922 seconds.
Before speeding up, the result was not returned even after 10 minutes, so it is a big difference in comparison.
In this way, we were able to reduce the number of loops while satisfying the necessary conditions for solving the problem, Let's think about how to make it even faster.
It seems that speeding up by modifying the logic can not be expected any more, By using Cython, you may be able to speed up the process without changing the existing logic.
To use it, first do pip install
as shown below.
bash
root@xxxxxxxxxxxx:/fast_python/cython# pip3 install cython
Next, cut out the processing you want to speed up as shown below into a file with the extension .pyx
.
I will write the code using Cython here.
As a general approach for speeding up using Cython, the variables to be used are as follows.
There is a way to declare it as a C language variable in the form cdef
. I will try this this time.
/fast_python/cython/prime.pyx
import cython
import math
def is_prime(x: int) -> bool:
cdef:
long i, stop #Declare a variable by specifying the C language type
stop = math.floor(math.sqrt(x) + 1)
if x <= 1:
return False
for i in range(2, stop):
if x % i == 0:
return False
return True
Next, prepare the following setup file.
/fast_python/cython/setup.py
from distutils.core import setup, Extension
from Cython.Build import cythonize
ext = Extension("prime", sources=["prime.pyx"])
setup(name="prime", ext_modules=cythonize([ext]))
Once these are ready, execute the following command.
bash
root@xxxxxxxxxxxx:/fast_python/cython# python setup.py build_ext --inplace
When executed, it will have a C file called prime.c
in the current directory.
A shared library file called prime.cpython-37m-x86_64-linux-gnu.so
will be created.
As a result, the is_prime ()
defined using Cython is changed as shown below.
It will be available by import
in your Python code.
/fast_python/cython/prime.py
from prime import is_prime #Is defined using Cython_prime()Import
def minimum_prime_number():
X = int(input().strip())
answer = 0
for i in range(X, (10 ** 14 + 32)):
if is_prime(i):
answer = i
break
print(answer)
if __name__ == '__main__':
minimum_prime_number()
Let's run this code and try profiling.
bash
root@xxxxxxxxxxxx:/fast_python/cython# python -m cProfile -s tottime prime.py
100000000000000
100000000000031
355 function calls (348 primitive calls) in 3.416 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.755 2.755 2.755 2.755 {built-in method builtins.input}
32 0.655 0.020 0.655 0.020 {prime.is_prime} # is_prime()Processing time
2 0.001 0.001 0.001 0.001 {built-in method _imp.create_dynamic}
5 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
17 0.000 0.000 0.001 0.000 <frozen importlib._bootstrap_external>:56(_path_join)
... (The following is omitted)
The bottleneck is_prime ()
now takes 0.655 seconds to process.
Compared to the previous processing time of 0.922 seconds, the processing time is about 2/3, which means that the speed has been further increased.
The processing time was shortened by speeding up the code and using Cython, but Further speeding up may be required. (I should say that it may not be so much.) In such a situation, writing in a faster programming language may be one of the approaches.
This time, the code of the is_prime ()
function written in Python,
After rewriting in C ++, a fast programming language, try calling it from Python code and executing it.
The code for rewriting the is_prime ()
function in C ++ is as follows.
/fast_python/pybind11/prime.cpp
bool is_prime(long x) {
for (long i = 2; i <= sqrt(x); i++) {
if (x % i == 0)
return false;
}
return true;
}
I want to call this process from Python code.
There seem to be several ways to do this, but this time I used pybind11
, which is easy to use.
As with Cython, first do pip install
as shown below.
bash
root@xxxxxxxxxxxx:/fast_python/pybind11# pip3 install pybind11
Next, add the code for binding as shown below to the C ++ code created earlier.
/fast_python/pybind11/prime.cpp
#include <pybind11/pybind11.h> //Add here
bool is_prime(long x) {
for (long i = 2; i <= sqrt(x); i++) {
if (x % i == 0)
return false;
}
return true;
}
PYBIND11_MODULE(prime, m) { //from here
m.def("is_prime", &is_prime); //
} //Add up to here
Finally, execute the following command to compile.
bash
root@xxxxxxxxxxxx:/fast_python/pybind11# g++ -O2 -Wall -shared -std=c++11 -fPIC `python3 -m pybind11 --includes` prime.cpp -o prime`python3-config --extension-suffix`
When executed, it will be called prime.cpython-37m-x86_64-linux-gnu.so
in the current directory as in Cython.
A shared library file is created.
You can now use C ++-defined functions by import
in your Python code.
/fast_python/pybind11/prime.py
from prime import is_prime # C++Function defined in is_prime()Import
def minimum_prime_number():
X = int(input().strip())
answer = 0
for i in range(X, (10 ** 14 + 32)):
if is_prime(i):
answer = i
break
print(answer)
if __name__ == '__main__':
minimum_prime_number()
When I executed this and actually measured the processing time using cProfile
, it became as follows.
bash
root@xxxxxxxxxxxx:/fast_python/pybind11# python -m cProfile -s tottime prime.py
100000000000000
100000000000031
138 function calls in 2.760 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.670 2.670 2.670 2.670 {built-in method builtins.input}
32 0.085 0.003 0.085 0.003 {built-in method prime.is_prime} # is_prime()Processing time
1 0.002 0.002 0.002 0.002 {built-in method _imp.create_dynamic}
1 0.000 0.000 2.756 2.756 prime.py:4(minimum_prime_number)
1 0.000 0.000 0.001 0.001 <frozen importlib._bootstrap>:882(_find_spec)
... (The following is omitted)
The processing time for is_prime ()
is now 0.085 seconds.
The processing time of is_prime ()
when using Cython was 0.655 seconds, so
From there, we were able to reduce the processing time to about 1/8.
There seems to be other ways to speed up, but this time I'll stop here.
The correspondence table of the processing time of the bottleneck (is_prime ()
) for each means is summarized below.
All of these processing times are for 10 ^ 14 as input.
means | Bottleneck processing time(Seconds) |
---|---|
No special action | 600 seconds or more |
for Speeding up to reduce the number of loops | 0.922 seconds |
for Speeding up to reduce the number of loops&Use Cython | 0.655 seconds |
for Speeding up to reduce the number of loops& C++ & pybind11 use |
0.085 seconds |
As you can see, there are many ways to speed up Python code.
Also, if you use C ++ and pybind11
, you can speed up the parts that can be speeded up in the logic.
It turned out that a considerable speedup can be expected.
In fact, compared to the untouched code, the processing time when using C ++ and pybind11
is less than 1/7000.
This time, the verification is limited to the viewpoint of speed, so if you look only at this
It seems that it will be faster if you use pybind11
hard, but
At the actual site, I think it is better to decide the policy while considering various factors such as maintainability, man-hours, and resource status.
Until the end Thank you for reading.
-High Performance Python --O'Reilly Japan
Recommended Posts