Python program is slow! I want to speed up! In such a case ...

When implemented in Python, the processing time may be too long to meet the requirements. In such a case, there are four types of countermeasures shown in this article.

If you do something to speed things up, you lose something else. Typical examples of trade-offs are freedom of expression, readability, dependencies, memory usage, and CPU usage. Please follow the usage and capacity and use it correctly.

Also, before speeding up, it is necessary to check what is slow in the script using a profiling tool or the like. Even if you do your best to speed up the processing that is not slow, it will have little effect on the overall execution time. This article does not explain the procedure.

4 types of speed-up methods

In this article, we will introduce the following four types of acceleration methods.

Rewrite the script in the category of language specifications / standard library
Use the library
Make a library
Optimize bytecode

1. Rewrite the script in the category of language specifications / standard library

Originally, other measures such as "using a library" and "creating a library" also correspond to "rewriting a script". Here, Python's Language Specification and Standard Library The following methods are introduced as things that can be done in the category of /index.html). It doesn't add much dependency (although it may depend on the Python version) as it can be done in the standard library category.

1-1. Review how to write Python scripts
1-2. Devise algorithms and data structures
1-3. Utilize the cache
1-4. Parallelize

1-1. Review how to write Python scripts

Python interpreter optimization doesn't do much. There are many ways to do the same thing, so you can choose a faster way to speed it up.

However, in an actual application, the difference in writing style given here does not make much difference. If you can't think of a conversion, or the slow processing is likely to be elsewhere.

Of course, if you pile up dust, you can do it. For example, if you try to implement a database using only Python, the slowness of processing as shown here will be effective, and it will be unusable as a whole.

Example 1: Sum from 1 to n

For example, the process of adding integers from 1 to n is "(1) Add using for" "(2) Built-in sum If you implement it in 3 ways, "use /functions.html#sum)" and "(3) use the summation formula", and measure the time, it will be as follows.

In [1]: %%timeit
   ...: sum = 0
   ...: for n in range(1000000):
   ...:     sum += n
87.3 ms ± 492 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %%timeit
   ...: sum(range(1000000))
33.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: ((1000000-1)*1000000)//2
13 ns ± 0.538 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In this example, the sum formula is fast, but you don't always come up with such a conversion.

Example 2: List generation

When generating the list, "(1) [List comprehension](https://docs.python.org/ja/3/reference/expressions.html?highlight=%E5%86%85%E5%8C%" Use 85 # displays-for-lists-sets-and-dictionaries) "" (2) append in a for loop " 4 ways: "(3) cache append" and "(4) use built-in list function" If you implement it with and measure the time, it will be as follows.

In [1]: %%timeit
   ...: l = [n for n in range(1000000)]
81.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [2]: %%timeit
   ...: l = []
   ...: for n in range(1000000):
   ...:     l.append(n)
130 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: l = []
   ...: l_append = l.append
   ...: for n in range(1000000):
   ...:     l_append(n)
97.9 ms ± 2.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
    ...: l = list(range(1000000))
58.1 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In this example, we're just generating a list, but in general you should do some "doing" before appending. Since this "various processing" is often slower, it is more effective to speed up the "various processing".

1-2. Devise algorithms and data structures

Algorithms and data structures are areas that have been studied since ancient times. If the amount of calculation is in order notation, $ O (n) $ and $ O (n ^ 2) $, the execution time is completely different. You may implement the algorithms and data structures yourself, or you may use the convenience classes found in the standard library.

Here are some examples of algorithms and data structures found in the standard library.

Example 3: Searching for data

If you want to determine many times that a list contains a particular element, set the list [set](https://docs.python.org/ja/3/library/stdtypes.html#set- It is faster to convert to types-set-frozenset) and then check with in. If you judge only once, the processing time to convert to a set will be effective.

In [1]: %%timeit
   ...: l = list(range(10000))
   ...: for n in range(10000):
   ...:     n in l
1.04 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %%timeit
   ...: l = list(range(10000))
   ...: s = set(l)
   ...: for n in range(10000):
   ...:     n in s
1.45 ms ± 14.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This time, I only searched 10,000 times for a list of 10,000 elements and got this result. With this amount of search, I think it will often appear in real applications.

Example 4: Manipulating data

You can use collections.deque if you want to add a value to the top of the list multiple times.

In [1]: from collections import deque

In [2]: %%timeit
   ...: l = []
   ...: for n in range(100000):
   ...:     l.insert(0, n)
6.29 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %%timeit
   ...: dq = deque()
   ...: for n in range(100000):
   ...:     dq.appendleft(n)
11.7 ms ± 93.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

A list in Python is like a variable length array. As for the contents of the array, there is no premise such as type and order, so it is highly versatile, but there is room for speeding up and efficiency in specific processing. In addition to collections.deque, heapq and [bisect](https://docs.python.org/ja/3/ The standard library provides functions for handling collections (collections of data) such as library / bisect.html).

1-3. Utilize the cache

The result of the calculation once performed is stored in the memory and reused, and the data fetched from the database is in the pickle format. You can speed it up by dumping. Anything that always returns the same result if the arguments are together, such as a mathematical function, can be cached. In some cases, reading a local file is faster than communicating with the database.

Example 5: Function memoization

Python has a standard library defined with a decorator called functools.lru_cache to memoize the function. .. If you use this, the result of calculation once will be cached, and if the arguments are the same, it will be reused.

In [1]: from functools import lru_cache

In [2]: def fib(n):
   ...:     return n if n <= 1 else fib(n-2) + fib(n-1)

In [3]: %%timeit
   ...: fib(30)
448 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: @lru_cache()
   ...: def fib(n):
   ...:     return n if n <= 1 else fib(n-2) + fib(n-1)

In [5]: %%timeit -n1 -r1
   ...: fib(30)
24.2 μs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

@lru_cache It's easy because all you have to do is add a decorator. You don't really want to calculate the Fibonacci sequence, and if you want to, you might have a different implementation. Is it a typical application example when you want to calculate the integral by machine learning?

1-4. Parallelize

Standard library threading and [multiprocessing](https://docs.python.org/ja/3/library/multiprocessing. Speed up by using html) or by parallelizing using concurrent added in Python 3.2. If the CPU is not the bottleneck, use threading, otherwise use multiprocessing. Although threading is easier to share data, it cannot overcome the 100% CPU barrier due to GIL (Global Interpreter Lock), so if you want to use up multi-core CPU, you should use multiprocessing.

Example 6: concurrent library

It's easy to use concurrent.future. You can use ProcessPoolExecutor to execute in parallel in multiple processes. There is also a Multithreaded ThreadPoolExecutor.

from time import sleep
from random import uniform
def f(x):
    sleep(uniform(0, 1) * 10)
    return x

from concurrent.futures import ProcessPoolExecutor, Future, wait
with ProcessPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(f, x) for x in range(10)]
    done, not_done = wait(futures)
    results = [future.result() for future in futures]

Simply make the process you want to parallelize into a function, specify the number of workers, and call it. Let's remember it as a fixed phrase.

Example 7: split command + python script

If you want to multi-process, you can do your best with shell instead of Python. Multi-process can be realized by making the above function f executable from the command line and calling it in parallel from the shell.

For example, suppose the input file is below (input).

`input`

Use the split command to split appropriately.

$ split -l 1 input

Execute the Python script with the split file as input.

$ for file in `ls x*`; do python f.py ${file}&; done

Combine the divided execution results.

$ cat x*.result > result

You can do it all in Python, but I think it's possible in some situations to easily multi-process just by adding & in the shell.

2. Use the library

There are many standard libraries in Python, and many more third-party libraries. Third-party libraries are published on PyPI and can be easily installed with the pip command.

2-1. Use the library
2-2. Review the environment variables and compilation conditions of the library

2-1. Use the library

Numerical calculations can be performed at high speed by using a library specialized in numerical calculations such as NumPy. There are libraries that are optimized for specific calculations, not just numerical calculations. CuPy makes it easy to perform GPU calculations from Python using CUDA.

Example 8: NumPy

NumPy can speed up matrix and vector calculations. Not only can it be faster, but the code is simpler and easier to understand. Here, we imagine a 100x1000 matrix and add each element of a list of 100000 elements.

In [1]: import numpy as np

In [2]: %%timeit
   ...: X = range(100000)
   ...: Y = range(100000)
   ...: Z = [x + y for x, y in zip(X, Y)]
13.7 ms ± 140 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %%timeit
   ...: X = np.arange(100000)
   ...: Y = np.arange(100000)
   ...: Z = X + Y
416 μs ± 2.33 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can see that NumPy is overwhelmingly faster. Lists take into account element types and list methods that change, but NumPy has fixed types and BLAS etc. You can speed up because you can use the library of.

Example 9: pickle and cPickle

There can be a difference in speed depending on what the library is implemented in. In Python 2 series, which will soon be unsupported, pickle and [cPickle](https: // There were two libraries (docs.python.org/ja/2.7/library/pickle.html#module-cPickle). There is a considerable difference in execution time depending on which one you use. In Python 3 series, cPickle is now called internally if cPickle is available when importing pickle instead of the name _pickle library.

In [1]: import pickle
In [2]: import cPickle
In [3]: l = range(1000000)

In [4]: %%timeit
   ...: with open("pickle.pkl", "wb") as f:
   ...:     pickle.dump(l, f)
1 loops, best of 3: 3.75 s per loop

In [5]: %%timeit
   ...: with open("cPickle.pkl", "wb") as f:
   ...:     cPickle.dump(l, f)
1 loops, best of 3: 376 ms per loop

Since it is Python 2 type IPython, the display is different from the others, but it is about 10 times the execution time ratio.

2-2. Review the environment variables and compilation conditions of the library

Recent CPUs have various speed-up functions. There is a difference in execution time depending on whether the acceleration function itself is enabled or the library that can use the acceleration function well is called. There is an explanation on the following page.

It seems that the calculation speed changes depending on the BLAS used for Numpy [Python] --In-slico Notebook

3. Make a library

If you already have a Python library, you can just install and use it, but you don't always have a Python library for what you want to do. Also, there are quite a few useless processes that cannot be solved by writing in Python. If you still want to speed up, you can use one of the following methods.

3-1. Implement the process in C language and call it with ctypes
3-2. Create a Python library in C language
3-3. Use Boost.Python (C ++ library)
3-4. Use Cython

3-1. Implement the process in C language and call it with ctypes

Python has the ability to load a library from a Windows dll file or a Linux so file and execute the functions in the library ([ctypes](https://docs.python.org/ja/3/library/ctypes. html)) has a standard library. It can be used when there is already an implementation that executes slow processing at high speed when written in Python, but it is not possible to call it from Python.

Example 10: ctypes

Python's math library doesn't have a cbrt function for cube roots, but libm -tipsref.com/reference/math.html). To call it, do the following:

In [1]: from ctypes import cdll, c_double
In [2]: libm = cdll.LoadLibrary("libm.so.6")
In [3]: libm.cbrt.restype = c_double
In [4]: libm.cbrt.argtypes = (c_double,)
In [5]: libm.cbrt(8.0)
Out[5]: 2.0

I'm not happy with this example, but I hope you understand that it's easy to call. Turning for on the Python side is slow, so you can speed it up by calling a function that also executes the process of turning for.

3-2. Create a Python library in C language

Python's C Extensions allows you to create libraries that can be called from Python using C or C ++. Python is like a wrapper for a library made in C / C ++ in the first place, so let's create a side that is called by the wrapper.

Example 11: Implement an addition function that can be called from Python in C

Let's create an addition function with C extension. First, write the following source code in C language.

`samplemodule.c`


#include <Python.h>

static PyObject * sample_add(PyObject *self, PyObject *args) {
    int x, y, z;

    if (!PyArg_ParseTuple(args, "ii", &x, &y)) {
        return NULL;
    }

    z = x + y;
    return PyLong_FromLong(z);
}

static PyObject *SampleError;

static PyMethodDef SampleMethods[] = {
    {"add",  sample_add, METH_VARARGS, "Sum x and y then return the sum."},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef samplemodule = {
    PyModuleDef_HEAD_INIT, "sample", NULL,  -1, SampleMethods
};

PyMODINIT_FUNC PyInit_sample(void) {
    PyObject *m;

    m = PyModule_Create(&samplemodule);
    if (m == NULL) {
        return NULL;
    }

    SampleError = PyErr_NewException("sample.error", NULL, NULL);
    Py_XINCREF(SampleError);
    if (PyModule_AddObject(m, "error", SampleError) < 0) {
        Py_XDECREF(SampleError);
        Py_CLEAR(SampleError);
        Py_DECREF(m);
        return NULL;
    }

    return m;
}

Specify the Python header file and compile with gcc.

$ gcc -I`python -c 'from distutils.sysconfig import *; print(get_python_inc())'` -DPIC -shared -fPIC -o sample.so samplemodule.c

Now that sample.so is complete, start the Python interpreter (IPython) and load it.

In [1]: import sample
In [2]: sample.add(1, 2)
Out[2]: 3

3-3. Use Boost.Python (C ++ library)

Boost.Python allows you to implement a Python library with less code. You don't have to worry too much about Python version differences. Instead, it depends on the Boost library, so the Python library won't load in an environment without the Boost library. A library created using Boost.Python is as fast as it is implemented in C, depending on how it is implemented.

Example 12: Implement an addition function that can be called from Python with Boost

Let's create an addition function with Boost.Python. First, write the following source code in C ++ language. That's it.

#include <boost/python.hpp>

int add(int x, int y) {
    return x + y;
}

BOOST_PYTHON_MODULE(sample) {
    using namespace boost::python;
    def("add", &add);
}

Specify the Python header file and Boost.Python and compile with g ++.

$ g++ -I`python -c 'from distutils.sysconfig import *; print(get_python_inc())'` -lboost_python -DPIC -shared -fPIC -o sample.so sample.cpp

Now that sample.so is complete, start the Python interpreter (IPython) and load it.

In [1]: import sample
In [2]: sample.add(1, 2)
Out[2]: 3

The amount of code is much shorter than writing from scratch in C language. It's nice.

3-4. Use Cython

With Cython, you can write source code in a slightly modified language of Python and generate libraries as fast as those written in C. The procedure is to create a pyx file in a Python-like language, generate C source code, and compile it.

Example 13: Implementing Fibonacci functions in Cython

Let's create a function in Cython to calculate the Fibonacci sequence. First, create a pyx file. It has almost the same grammar as Python, with a little type information.

`fib.pyx`


def fib(int n):
    return n if n <= 1 else fib(n-2) + fib(n-1)

Convert the pyx file to C source code with the cython command. Then compile with gcc.

$ cython fib.pyx
$ gcc -I`python -c 'from distutils.sysconfig import *; print(get_python_inc())'` -DPIC -shared -fPIC -o fib.so fib.c

Now that fib.so is complete, start the Python interpreter (IPython) and load it.

In [1]: from fib import fib

In [2]: %%timeit
   ...: fib(30)
304 ns ± 2.53 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

When I wrote everything in Python, it was "448 ms ± 4.22 ms", which was a considerable improvement.

4. Optimize bytecode

So far, we have created a library in C to speed it up. Creating a library required knowledge of C and Cython's extended Python language, and required learning languages other than Python. However, learning a new language can be a daunting task. Here, we will introduce the following techniques that can extend an existing Python program as it is.

4-1. Use Numba
4-2. Use PyPy

4-1. Use Numba

Numba uses LLVM to generate bytecode from LLRM IR for speed. As you can see in What you can do before a Python script is executed-Qiita, Python is a VM-type scripting language with intermediate instructions (bytecode). The process is realized by executing the above in a straightforward manner.

According to the Guide on the original site, it not only optimizes the bytecode but also the machine It seems to generate an instruction and replace a function call. It's amazing ...

Example 14: Numba jit decorator

In [1]: @jit
   ...: def fib(n):
   ...:     return n if n <= 1 else fib(n-2) + fib(n-1)
In [9]: %%timeit
   ...: fib(30)
12.6 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

4-2. Use PyPy

PyPy is a Python processing system written in Python. ~~ RPython, which is a little more limited than Python and easy to optimize, works. It is implemented in ~~ RPython and is compatible with Python 2.7 and Python 3.6. According to Python compatibility-PyPy, Django and Flask Well-known libraries such as /) have been confirmed to work, and according to PyPy Speed, the execution speed seems to be quite excellent.

Some wording has been revised based on the comments.

Finally

In this article, I introduced the following speed-up methods.

Rewrite the script in the category of language specifications / standard library

1-1. Review how to write Python scripts
1-2. Devise algorithms and data structures
1-3. Utilize the cache
1-4. Parallelize

Use the library

2-1. Use the library
2-2. Review the environment variables and compilation conditions of the library

Make a library

3-1. Implement the process in C language and call it with ctypes
3-2. Create a Python library in C language
3-3. Use Boost.Python (C ++ library)
3-4. Use Cython

Optimize bytecode

4-1. Use Numba
4-2. Use PyPy

Which method is appropriate for dealing with a process that you feel is slow depends on the time and the case. It would be nice to be able to apply the optimal method after considering various options.

I've only given examples of Fibonacci sequences, but finding Fibonacci sequences is not a common task for applications. Cython may be faster than Numba depending on the process you want to speed up. This time I'm using recurrence to find the Fibonacci sequence, but if you remove the recurrence in the first place, it will be faster ...

May your Quality of Python Life be wonderful.

reference

In writing this article, I referred to the following article.

In addition, I have been indebted to wonderful articles and books other than the above when learning Python. I am very sorry that I cannot introduce it here. Thank you for publishing and publishing great information. I believe that great information will be useful to those who are learning Python.