I tried to find out as much as possible about the GIL that you should know if you are doing parallel processing with Python

I recently had the opportunity to write an orchestration layer (BFF) application in Python.

Asyncio was introduced from Python 3.4, and I / O bound processing can be handled efficiently even with a single thread, but CPU bound processing still has GIL, so parallel processing can be done under a single process. It will be restricted.

From this, it can be seen that it is suitable for handling multiple I / O bound processes rather than CPU bound as a language characteristic. It is an important factor when making a language selection decision, but I thought that it was necessary to know the mechanism of GIL again for that purpose, so I investigated it.

What is GIL (Global Interpreter Lock)?

What is GIL in the first place?

Formally called Global Interpreter Lock, it is an exclusive locking mechanism found in languages such as Python and Ruby. Looking at these two languages alone may seem characteristic of dynamically typed languages, but rather they are involved in coordinating with C.

As for Python, CPython implemented in C language is widely used.

Before going into the explanation of GIL, if you explain that "GIL exists in Python", there is a misunderstanding, so let's pick it up a little more carefully.

In the first place, there are multiple implementations in the language Python. The most widely used is CPython implemented in C, which is probably implicitly referred to when the language characteristics of Python are described.

Other typical examples are Jython implemented in Java and IronPython running on the .Net Framework, but these do not have a GIL. Why is CPython used when you hear that alone? You may think that major libraries like NumPy are often implemented in C, and CPython is often used due to the frequency of language implementation updates.

Based on these, we will explain GIL based on the CPython specifications.

About GIL exclusive lock

Now, let's return to the main subject and go into the explanation of GIL, but roughly speaking, ** "Bytecode can be executed only by a single thread that has a lock even under multiple threads, and other threads are in a standby state" ** That is. The lock is released at regular intervals, and another thread that newly acquires the lock executes the program.

The locking mechanism will be described later, but for the time being, it should be recognized that CPU-bound processing can only be executed by one thread, and parallelization of processing is limited.

Let's actually see the effect of GIL in code. First, run a simple program that counts a large number.

`countdown.py`


def countdown():
    n = 10000000
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start = datetime.now()

    t1 = Thread(target=countdown)
    t2 = Thread(target=countdown)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end = datetime.now()
    print(f"Time: {end - start}")

It is a program that counts down 10 million with 2 threads, but the execution time was about 1.1 seconds in my environment. So what about running it in one thread?

if __name__ == '__main__':
    start = datetime.now()
    countdown()
    end = datetime.now()
    print(f"Time: {end - start}")

This finished in about 0.53 seconds. About half of the two threads means that each thread is not running in parallel. This is because CPU-bound processing can only be executed by one thread.

But what about non-CPU bound processing? Replace the countdown with sleep and try running it in 2 threads.

`sleep.py`


def sleep():
    time.sleep(2.0)

At this time, the process is completed in about 2 seconds. If it is CPU bound, it takes 4 seconds in 2 x 2 seconds, but if it is sleep, it takes half 2 seconds. This is because the lock was released when the sleep was executed, and the waiting thread went to sleep immediately afterwards, which was processed in practically parallel manner.

By the way, locks occur when executing Python bytecode, not necessarily when using the CPU.

Why the GIL exists

So why is there a GIL that limits parallel processing in the first place? This is not a solution I derived by reading the CPython code myself, but the following seems to be the main factors.

--To simplify low-level mechanisms such as memory management and C integration. --CPython is easy to work with libraries implemented in C, but they are usually not thread-safe.

For the above reasons, in order to execute CPython, only one thread needs to be able to operate bytecode, and there is a mechanism called GIL to realize that.

However, this is not a characteristic of the language itself, Python, but is associated with CPython implemented in C. For example, Jython implemented in Java does not have a GIL because conflicts do not occur even under multithreading thanks to thread management by the JVM.

However, CPython is probably used a lot, probably because it is judged that the advantages of being able to utilize C language assets and active updates are greater than the advantages of avoiding the GIL.

Opening and acquiring the GIL

(The description of this item is based on Understanding the Python GIL)

The CPython GIL mechanism has changed with version 3.2, starting from a lock release request called ** gil_drop_request **.

For example, if there is only one thread, execution will continue until a single thread finishes processing. This is because the unlock request has not arrived from anywhere.

On the other hand, it is different when there are multiple threads. Suspend threads wait 5ms by default, then set gil_drop_request to '1'. The running thread then releases the lock and signals it.

スクリーンショット 2019-11-09 11.04.04.jpg

When a thread waiting for a lock receives the signal, it acquires the lock, but at that time it sends a signal to inform that it has acquired it. The thread that just released the lock enters the suspended state by receiving the signal.

スクリーンショット 2019-11-09 11.04.33.jpg (* All images are quoted from Understanding the Python GIL)

After the timeout, multiple threads will repeat the lock acquisition and release in the same way as before, trying to acquire the lock again by setting gil_drop_request.

Timeout time can be changed

Threads waiting for locks wait 5ms by default, which can be referenced in time from the Python code sys.getcheckinterval (). You can also change the interval time with sys.setcheckinterval (time).

Why did it become a method of sending a lock release request?

From Python 3.2, the lock is released by gil_drop_request, but before that, the lock was released per execution unit called tick.

By the way, this can be referenced by sys.getcheckinterval (), but since it is no longer used due to the change of the locking method, the following warning message is displayed.

DeprecationWarning: sys.getcheckinterval() and sys.setcheckinterval() are deprecated.  Use sys.getswitchinterval() instead.

So why did the lock release method change?

As mentioned earlier, the waiting thread now sends a lock release request, but previously the running thread released the lock after 100 ticks of execution units by default. However, this has some problems in a multi-core situation.

Let's start with the single core case. When a running thread releases the lock, it signals one of the waiting threads. The thread that receives the signal is placed in the queue waiting to be executed, but the OS scheduler selects whether the thread that just released the lock or the thread that received the signal is executed next based on the priority. To do.

(The same thread may acquire locks in succession, which may be desirable given the overhead of context switching.)

However, in the case of multi-core, both for executable thread there is more than one also trying to acquire the lock, one will fail to acquire the lock. Attempting to acquire a lock unnecessarily is also an overhead, and the trouble is that waiting threads can hardly acquire a lock.

Waiting threads have a time lag before resuming, so when you try to acquire a lock, the thread you just released is often already acquiring the lock. It seems that one thread may keep the lock for more than tens of minutes in a long process.

In addition, in cases where I / O processing that ends immediately due to the buffering of the OS frequently occurs, there is also the disadvantage that the load increases because locks are released and acquired one after another each time I / O is waited for. ..

Given the above issues, the current method of sending requests by waiting threads is better.

Disadvantages of the current GIL

So, if there is a problem with the current GIL, it is not. The material Understanding the Python GIL introduces two disadvantages.

1. Unfair lock acquisition may occur

First, if there are three or more threads, the thread that requested the lock release may not be able to acquire the lock and may be taken by the delayed thread.

スクリーンショット 2019-11-09 12.23.39.jpg (* Quoted from Understanding the Python GIL)

In the image above, thread 2 requests a lock release after a timeout, and thread 1 signals the lock release as well. Originally, thread 2 should have acquired the lock, but in the meantime, thread 3 was queued later, so the lock was acquired preferentially.

In this way, depending on the timing, lock acquisition may be biased to a specific thread, and parallel processing may become inefficient.

2. Inefficiency may occur due to Convoy Effect

Also, if a CPU-bound thread and an I / O-bound thread are running at the same time, an inefficient state called Convoy Effect may occur.

From the viewpoint of the whole process, I / O bound threads are given priority to hold locks, when I / O waits, they move to CPU bound threads, and when I / O is completed, locks are given priority again. It is efficient to let it. On the other hand, if only CPU-bound threads have locks, I / O-bound processing will remain, and the execution time will be extended by the amount of waiting time for I / O waiting.

However, threads have no priority, so you have no control over which thread acquires the lock preferentially. If two threads are waiting, the CPU-bound thread may acquire the lock first.

Also, even if the I / O ends immediately, it is necessary to wait until the timeout. If a large amount of I / O waits occur, CPU-bound processing may end while waiting for sequential timeouts, leaving only I / O waits.

This is called the "Convoy Effect", but since it only requires the lock to be released after a timeout, it can be inefficient from the perspective of overall optimization.

Perform parallel processing in multiple processes

As many of you know, CPU-bound processing can be executed in parallel by making it multi-process. This is because each process holds an interpreter and the GIL exists on an interpreter basis.

Let's try to execute the part that was processed in multi-thread earlier in multi-process.

`countdown.py`


def countdown():
    n = 10000000
    while n > 0:
        n -= 1

if __name__ == '__main__':
    start = datetime.now()

    t1 = Process(target=countdown)
    t2 = Process(target=countdown)
    t1.start()
    t2.start()
    t1.join()
    t2.join()

    end = datetime.now()
    print(f"Time: {end - start}")

Processing that took about 1.1 seconds under multi-process is now about 0.65 seconds under multi-process. You can see that it can be executed in parallel even with CPU bound.

Although it has a large overhead compared to threads, it can share values between processes and is useful when executing CPU-bound processing in parallel.

Sub-interpreter introduced in Python 3.8

At the time of this writing, the sub-interpreter was tentatively implemented in the just-released Python 3.8. The sub-interpreter is proposed in PEP 554 but has not yet been merged.

As mentioned earlier, the GIL exists on an interpreter basis, but sub-interpreters allow you to have multiple interpreters in the same process.

It's an idea with potential in the future, but since CPython has state in the Runtime, it seems that it still has a lot of problems to hold state in the interpreter.

You can actually use it by upgrading to Python 3.8 and importing _xxsubinterpreters, but it may still be difficult to use at the production level.

Leverage event loops with asyncio

It's a methodological story that deviates from the main point of explaining the GIL, but in the case where multiple I / O waits occur in the current Python, it may be more practical to utilize the event loop by ʻasyncio`.

ʻAsyncio` has I / O multiplexing that allows multiple I / O operations to be processed efficiently in a single thread, which is close to the benefits of multithreading.

In addition to saving memory compared to multithreading, there is no need to consider lock acquisition / release of multiple threads, and native coroutines by async / await can be written intuitively, which will reduce the thinking burden on programmers.

The Python coroutines will be introduced in detail in a separate article.

Finally

The story spread in the second half, but this article has comprehensively described topics related to the GIL.

It may not be a point to be aware of every day when writing application level code, but I thought that knowing what restrictions there are when performing parallel processing would be useful for language selection, so write an article. I did.

I tried to find out as much as possible about the GIL that you should know if you are doing parallel processing with Python

What is GIL (Global Interpreter Lock)?

As for Python, CPython implemented in C language is widely used.

About GIL exclusive lock

countdown.py

sleep.py

Why the GIL exists

Opening and acquiring the GIL

Timeout time can be changed

Why did it become a method of sending a lock release request?

Disadvantages of the current GIL

1. Unfair lock acquisition may occur

2. Inefficiency may occur due to Convoy Effect

Perform parallel processing in multiple processes

countdown.py

Sub-interpreter introduced in Python 3.8

Leverage event loops with asyncio

Finally

Reference material

`countdown.py`

`sleep.py`

`countdown.py`