Introduction to Parallel Processing + New Parallel Execution Unit in Ruby Ractor

This article provides an introductory knowledge of parallel processing. It also mentions the new parallel execution unit Ractor being developed in Ruby.

First, I will summarize the terms that are often confused when discussing this topic.

About parallel processing (parallel) and parallel processing (concurrent)

In ** parallel processing **, multiple processes run at the same time at a certain moment. ** Concurrent processing ** processes multiple processes in order by time division. Unlike parallel processing, only one process runs at the same time at a given moment.

If the timing at which a plurality of processes are executed is shown in chronological order, the image shown below is obtained. (Processing is executed only in the part with the blue line)

This article deals with the behavior of parallel processing, but be aware that even if you write code for parallel processing, it may eventually behave like parallel processing. (For example, a single-core CPU cannot run two or more processes in parallel.) The OS and VM are scheduling nicely around here.

Multi-process and multi-thread

In general, there are two main methods for achieving parallel processing: ** multi-process ** and ** multi-thread **. Multi-process is a method of creating multiple processes and having each process execute one process at a time. Multithreading is a method of creating multiple threads in one process and having each thread execute one process at a time.

In the case of multi-process, the memory space is separated in each process. Therefore, it is basically impossible to pass variables between processes. It is also highly secure, as it prevents unintended memory-based interactions between processes. The disadvantage is that each process has a memory space, so the total memory usage tends to increase. (However, in linux, the memory between processes is shared as much as possible by the mechanism called Copy on write.)

In the case of multithreading, one process has multiple threads, so the memory space is shared between the threads. Therefore, memory usage can be suppressed, and depending on the implementation, thread creation and switching is lighter than process creation and switching. However, since threads can affect each other via memory, bugs such as data races tend to occur. In general, multithreaded programming has many things to consider and is difficult to implement correctly.

The unit in which one process is executed in parallel processing is called ** parallel execution unit **. In the case of multi-process, the parallel execution unit is a process, and in the case of multi-thread, it is a thread.

(Supplement) How to realize thread processing

There are two main methods of implementing thread processing: ** native threads ** and ** green threads **. Native thread is a method to realize multi-thread processing by using the OS implementation as it is. Since thread scheduling (deciding which thread to execute processing now) is left to the OS, the implementation of the processing system becomes simple. On the other hand, there is also the disadvantage that the processing of thread creation and switching (so-called context switching) is heavy. (By the way, native threads are, to be exact, a concept that combines kernel threads and lightweight processes, but details are omitted. I feel that native threads and kernel threads are often mixed.)

Green threads are threads originally implemented in a language processing virtual machine (for example, yarv of cruby, jvm of java, etc.), and are a method of realizing multithread processing. Golang's goroutine is also a type of green thread, and its lightness of operation is too famous. In cruby, it was implemented by green threads before 1.9, but now it has been changed to use native threads. Green threads are also called user threads.

Multithreaded, multiprocess code example

As an example, the implementation of parallel processing in Ruby is shown. In Ruby, you can easily describe parallel processing by using the gem Parallel.

The multi-process code looks like this:

`multi_process.rb`


require 'parallel'

Parallel.each(1..10, in_processes: 10) do |i|
  sleep 10
  puts i
end

If you run this code and look at the process list, it looks like this: You can see that there is one main process and 10 child processes.

$ ps aux | grep ruby
  PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND PRI     STIME     UTIME
79050   9.7  0.1  4355568  14056 s005  S+    2:39PM   0:00.28 ruby mp.rb
79072   0.0  0.0  4334968   1228 s005  S+    2:39PM   0:00.00 ruby mp.rb
79071   0.0  0.0  4334968   1220 s005  S+    2:39PM   0:00.00 ruby mp.rb
79070   0.0  0.0  4334968   1244 s005  S+    2:39PM   0:00.00 ruby mp.rb
79069   0.0  0.0  4334968   1244 s005  S+    2:39PM   0:00.00 ruby mp.rb
79068   0.0  0.0  4334968   1172 s005  S+    2:39PM   0:00.00 ruby mp.rb
79067   0.0  0.0  4334968   1180 s005  S+    2:39PM   0:00.00 ruby mp.rb
79066   0.0  0.0  4334968   1208 s005  S+    2:39PM   0:00.00 ruby mp.rb
79065   0.0  0.0  4334968   1252 s005  S+    2:39PM   0:00.00 ruby mp.rb
79064   0.0  0.0  4334968   1168 s005  S+    2:39PM   0:00.00 ruby mp.rb
79063   0.0  0.0  4334968   1168 s005  S+    2:39PM   0:00.00 ruby mp.rb

The multithreaded code looks like this:

`multi_threads.rb`


require 'parallel'

Parallel.each(1..10, in_threads: 10) do |i|
  sleep 10
  puts i
end

Look at the thread list here as well.

If you add -L to the ps command, the thread will appear like a process. Without -L, there is only one process, but with -L, 11 lines are displayed. In addition, the NLWP column shows the number of threads in the process, and since this is 11 (main thread x1 + worker thread x10), it can be seen that multithread processing is used.

$ ps aux | grep mt.rb
 4419  1.0  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
$ ps aux -L | grep mt.rb
  PID   LWP %CPU NLWP %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 4419  4419  6.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4453  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4454  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4455  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4456  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4457  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4458  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4460  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4461  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4462  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb
 4419  4463  0.0   11  0.6 850176 12384 pts/1    Sl+  15:41   0:00 ruby mt.rb

Difficulty of multithreaded programming

In multithreaded processing, various problems may occur because the processes are executed in parallel with a plurality of threads sharing memory. One of the main issues is ** data racing **.

Data races can occur with code like the one below. This code attempts to find the sum of integers from 1 to 10, but due to data racing issues, it may not be possible to find the sum correctly.

require 'parallel'

sum = 0;

Parallel.each(1..10, in_threads: 10) do |i|
  add = sum + i
  sum = add
end

puts sum

In this code, each thread shares the variable sum, and each thread reads and writes sum at the same time. As a result, the content written in one thread may be overwritten by another thread. Therefore, there is a problem that the above code may not be able to calculate the sum normally.

A common way to solve data racing problems is to take exclusive locks between threads.

require 'parallel'

sum = 0;
m = Mutex.new

Parallel.each(1..10, in_threads: 10) do |i|
  m.lock
  add = sum + i
  sum = add
  m.unlock
end

puts sum

As a result, only one thread is executed at a time while the lock is being held, and the data race is eliminated.

Code that properly considers these issues and works well in multithreading is called ** threadsafe **.

About Global Interpreter Lock

** GIL ** is often talked about in multithreaded processing in lightweight languages (ruby, python, etc.). By the way, in Ruby, it is called GVL (Giant VM Lock).

The GIL prevents multiple threads from being executed at the same time by performing exclusive control between threads. In other words, only one thread can be executed at the same time in one interpreter and VM. Reasons and benefits of this need include:

When doing multi-thread programming, it is no longer necessary to describe exclusive processing for each individual data structure.

Speed up processing
Easy coding for app developers (less thinking about data racing)

Native plugin implementations are often not thread-safe, but to run them safely without changing their implementations
The VM implementation itself is not thread safe

Thanks to GIL, the multithreaded programming Ruby code I just illustrated works fine even without Mutex. It can be said that this behavior is no different from Ruby's basic idea of making programming easier.

However, the fact that only one thread can be executed at a time means that the original parallel processing is impossible. This is why it is often mentioned that Ruby and Python are not suitable for parallel computing.

Exceptionally, when waiting for I / O, the thread releases the GIL, so multiple threads can execute processing at the same time. For this reason, in processing with a lot of I / O waiting (web server, etc.), multithreading is practically used even in a processing system with GIL.

Implementation example

Since HTTP servers usually need to process each request at the same time, parallel processing is often implemented. Typical HTTP servers in Ruby are ** unicorn ** and ** puma **, the former is a multi-process implementation and the latter is a multi-threaded implementation.

Performance of unicorn and puma is compared in this blog.

The conclusions of this blog are as follows:

For CPU bound processing, unicorn has a slightly better performance
For I / O bound processing, puma has overwhelmingly better performance.

Source

This is a convincing result even considering the above mechanism.

About Ractor

So far, we have explained how to realize parallel processing, and showed the implementation in Ruby and its performance. Multithreaded processing in Ruby has a problem that it cannot achieve its original performance due to GVL. Ractor (formerly Guild) is a new Ruby parallel processing mechanism that was created to solve this problem.

Ractor can achieve true multithreaded performance while retaining the advantage of making traditional GVL multithreaded programming easier to handle.

I will explain the mechanism.

Ractor's Thought

Data racing occurs because multiple threads can read and write to one variable because the threads share memory. The way to solve this is

Make all variables Immutable
Variables shared between threads are specified by type and detected at compile time for processes that are not thread-safe.
Make memory independent for each parallel execution unit

In Ractor, three methods were adopted. This new parallel execution unit is called Ractor. A Ruby process has one or more Ractors, and one Ractor has one or more threads. Since each Ractor operates in a separate memory space, there is no problem with sharing memory as in conventional threads.

Source: https://www.slideshare.net/KoichiSasada/guild-prototype

In addition, Ruby code before the introduction of Ractor can maintain backward compatibility by running it within one Ractor.

How to share data between Ractors

Since Ractors do not share memory, you may find it cumbersome to pass information. To solve this, there is also a function called channel that realizes communication between Ractors. Objects you want to share can only be passed via channel.

Objects are classified into ** sharable objects ** and ** non-sharable objects **.

A sharable object is an object, such as a read-only constant, that cannot cause data races when shared between Factors. Shareable objects can be freely shared through channel.

Non-shareable objects refer to general mutable objects. Passing this object through the channel causes deep copy or move semantics. In the case of deep copy, copy processing cost and memory usage increase, but it is as safe and easy to understand as multi-process. In the case of move semantics, ownership of the object is transferred to another Factor. Therefore, the original Ractor cannot refer to the object, but unlike deep copy, the processing cost and memory usage do not increase as much as copying.

Summary:

Basically do not share memory between Ractors
Developers explicitly specify and share only the objects they need
When sharing, select the best method according to the object

By doing so, Ractor realizes easy multi-thread programming while maintaining thread safety.

Ractor is a parallel execution unit located between processes and threads. By properly selecting the information that the developer wants to share between Ractors, parallel processing can be realized without increasing the RAM usage as much as multi-process and without the performance degradation due to GIL unlike multi-threading.

Ractor's present

Ractor is getting a lot of attention as a new feature in Ruby 3. It seems that Ractor itself is still under development, and it will be a little while before it reaches the reach of general Ruby users. In the future, it is expected that Ruby's multithreaded library will be reimplemented in Ractor. It may be near the time when HTTP servers that replace Puma will become mainstream.

References

https://github.com/ko1/ruby/blob/ractor/ractor.ja.md
http://www.atdot.net/~ko1/activities/sasada_ipsj_pro_120.pdf
https://qiita.com/Kohei909Otsuka/items/26be74de803d195b37bd
https://qiita.com/yohhoy/items/00c6911aa045ef5729c6
https://zenn.dev/yohhoy/articles/multithreading-toolbox

This is a sentence summarized for study. I would appreciate it if you could point out any mistakes!