-I want to process a large amount of data in parallel. -I want to make effective use of the CPU because it has many cores. -I tried writing the script for the time being, but I can't afford to rewrite it for parallel execution.

I researched various things and it seemed easy to use subprocess.Popen (), so I experimented.

Reference article https://qiita.com/HidKamiya/items/e192a55371a2961ca8a4

Execution environment and sample code for load test

Windows 10 (64bit) Python 3.7.6

The CPU is Ryzen 9 3950X and this kind of environment. タスクマネージャー 2020_06_27 20_15_14.png

Experiment with the code below. (Only display the last two digits of the Fibonacci sequence specified by the command line argument.)

`fib_sample.py`


import sys

def fib(n):
    a, b = 0, 1
    for i in range(n):
        a, b = b, a + b
    return a

if __name__ == "__main__":
    num = int(sys.argv[1])
    result = fib(num)
    print("n = {0}, result % 100 = {1}".format(num, result % 100))

For example, if you run python fib_sample.py 10000, it will display n = 10000, result% 100 = 75 and exit.

Run sequentially

First, try executing it sequentially with subprocess.run (). Execute python with arguments in subprocess.run (['python', r". \ Fib_sample.py ", str (500000 + i)]). If you change the command line argument from 500000 to 500063 and execute it 64 times,

`batch_sequential.py`


from time import time
import subprocess

start=time()

loop_num = 64
for i in range(loop_num):
    subprocess.run(['python', r".\fib_sample.py", str(500000 + i)])

end=time()
print("%f sec" %(end-start))

> python .\batch_sequential.py
n = 500000, result % 100 = 25
n = 500001, result % 100 = 26
n = 500002, result % 100 = 51
(Omitted)
n = 500061, result % 100 = 86
n = 500062, result % 100 = 31
n = 500063, result % 100 = 17
130.562213 sec

It took a little over two minutes. Obviously, the CPU core is not used at all. タスクマネージャー 2020_06_27 20_12_28.png

Parallel execution

Try to execute the same process in parallel with subprocess.Popen (). Unlike subprocess.run (), subprocess.Popen () does not wait for the spawned process to finish. The following code repeats the process execution → wait for all of them to finish → the next process execution → ... for the number specified by max_process.

`batch_parallel.py`


from time import time
import subprocess

start=time()

#Maximum number of parallel process executions
max_process = 16
proc_list = []

loop_num = 64
for i in range(loop_num):
    proc = subprocess.Popen(['python', r".\fib_sample.py", str(500000 + i)])
    proc_list.append(proc)
    if (i + 1) % max_process == 0 or (i + 1) == loop_num:
        #max_Wait for the end of all processes for each process
        for subproc in proc_list:
            subproc.wait()
        proc_list = []

end=time()
print("%f sec" %(end-start))

The result of executing 16 in parallel according to the number of physical cores of Ryzen 3950X.

> python .\batch_parallel.py
n = 500002, result % 100 = 51
n = 500004, result % 100 = 28
n = 500001, result % 100 = 26
(Omitted)
n = 500049, result % 100 = 74
n = 500063, result % 100 = 17
n = 500062, result % 100 = 31
8.165289 sec

Since they are executed in parallel, the order in which the processing ends is out of order. It was almost 16 times faster from 130.562 seconds to 8.165 seconds.

You can see that all the cores are used and they are executed correctly in parallel. タスクマネージャー 2020_06_27 20_22_37.png

By the way, it is not quick to execute 32 in parallel according to the number of logical cores instead of the number of physical cores. Rather sometimes it gets late. The graph below shows the average execution time when the number of parallel executions is changed and executed three times each. プレゼンテーション1.png I've been running various applications in the background so it's not very accurate, but I think the trend is correct.

Summary

It was fairly easy to speed up. With the above code, after starting parallel execution, it waits for the end of the process with the longest processing time, so if there is a process that happens to have a long processing time, the overhead of waiting for that process will increase. Originally, I think that the code should be such that the number of parallels is always constant by starting the next process as soon as the execution of each process is completed. However, in reality, the purpose was to speed up the application, such as applying a large amount of data files of the same length to the same signal processing, so I closed my eyes with the expectation that the execution time would be about the same. For the time being, I was satisfied because I was able to easily achieve the purpose of speeding up.

Easy parallel execution with python subprocess