I tried to make full use of the CPU core in Ruby

I've been a programmer for Tabelog since 2008, and I'm still a programmer Oishii Tsukasa. The reason why the name is hiragana is to prevent people in the company from revealing their activities in an era when it was rare to be active on the Internet. I have a black history that had a homepage called "Tsukasa's Room" in 1995. It is not great? Wouldn't it be super bad if someone in the company found out? : cat2: I tried to put a cat emoji without any meaning. I tried to play a little. Maybe I'm not going to read it anymore?

Make full use of CPU core in Ruby

Even if thread programming is done in Ruby, the core of the CPU cannot be utilized. This is because Ruby always has only one thread running at the same time due to Giant VM lock (GVL). When waiting for I / O, the GVL is released, allowing multiple threads to run at the same time. It will work effectively when you hit multiple URLs to fetch data.

This is not the case when performing numerical calculations. For example, let's say you want to calculate the product of two 512x512 matrices.

#Make two 512x512 matrices
def build(num, max)
  m = []
  
  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = rand(max)
    end
  end
  
  m
end

num = 512
max = 256

a = build(num, max)
b = build(num, max)

Calculate the product of these two matrices, ʻa and b`. To be honest, the amount of calculation is $ O (n ^ {3}) $ because it requires triple loops.

def m_and(a, b, num)
  m = []
  
  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = 0

      num.times do |k|
        m[i][j] += a[i][k] * b[k][j]
      end
    end
  end
  
  m
end

I will measure the processing time.

require 'benchmark'
puts Benchmark.realtime { m_and(a, b, num) }
18.133936062455177

It took about 18 seconds. The machine I ran has four CPU cores. Let's take a look at the CPU usage during processing.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0      0.00      0.00      0.00      0.00      0.00    100.00
  1      0.00      0.00      1.00      0.00      0.00     99.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3      0.00      0.00      0.00      0.00      0.00    100.00

Only one core is doing its best.

Process with multiple threads

Let's process with multithreading. Calculation of one row x column can be performed independently in parallel, so let's divide the data in that unit and let each thread process it.

def t_m_and(a, b, num, t_size)
  m = []
  queue = Queue.new
  threads = []

  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = 0
      queue.push([i, j])
    end
  end

  t_size.times do
    threads << Thread.new(queue) do |q|
      until q.empty?
        begin
          ti, tj = q.pop(true)
          num.times do |k|
            m[ti][tj] += a[ti][k] * b[k][tj]
          end
        rescue ThreadError
          raise unless q.empty?
        end
      end
    end
  end
  threads.each(&:join)

  m
end

Since there are 4 CPU cores, it is executed with 4 threads.

puts Benchmark.realtime { t_m_and(a, b, num, 4) }
18.22166531533003

This also took about 18 seconds. The CPU usage during execution was as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.06      0.00      0.00      0.00      0.00     74.94
  0     29.70      0.00      0.99      0.00      0.00     69.31
  1     28.00      0.00      0.00      0.00      0.00     72.00
  2     22.00      0.00      0.00      0.00      0.00     78.00
  3     20.20      0.00      0.00      0.00      0.00     79.80

I'm using all the cores properly, but since only one thread can move at the same time, the total time is the same as when processing sequentially. I will manage to make full use of all the cores.

Write with C extension library

If you release GVL, you can run multiple threads at the same time. In this matrix calculation, the processing of each thread should not affect each other, so it can be expected that there is no problem even if GVL is released. In order to release GVL, it is necessary to write an extension library in C, so first write the process of matrix multiplication in C language.

static VALUE
and(VALUE self, VALUE a, VALUE b, VALUE num)
{
  long *ma, *mb, *mc;
  long n;
  VALUE result;

  n = NUM2LONG(num);
  ma = ary_to_cary(a, n);
  mb = ary_to_cary(b, n);

  mc = and_matrix(ma, mb, n);
  result = cary_to_ary(mc, n);

  ruby_xfree(ma);
  ruby_xfree(mb);
  ruby_xfree(mc);

  return result;
}

void
Init_fullcore_matrix(void)
{
  VALUE cFullcoreMatrix = rb_define_class("FullcoreMatrix", rb_cObject);

  rb_define_method(cFullcoreMatrix, "m_and", and, 3);
}

The two-dimensional array is converted to a one-dimensional array for ease of use. ʻAry_to_caryconverts ruby's Array to c'slong * (not a standard function, I wrote it for this, though I haven't posted it). The ʻand_matrix function is as follows.

static long *
and_matrix(long *a, long *b, long num)
{
  long *m;
  long i, j, k;
  long index;

  m = (long*)ruby_xmalloc(sizeof(long) * num * num);

  for(i = 0; i < num; i++) {
    for(j = 0; j < num; j++) {
      index = i * num + j;
      m[index] = 0;
      for(k = 0; k < num; k++) {
        m[index] += a[i * num + k] * b[k * num + j];
      }
    }
  }

  return m;
}

Let's do this.

fm = FullcoreMatrix.new 
puts Benchmark.realtime { fm.m_and(a, b, num) }
0.39124061167240143

Overwhelmingly fast! The process that took 18 seconds is now 391 milliseconds.

The CPU usage is as follows, only one core is working hard.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0      0.00      0.00      1.00      0.00      0.00     99.00
  1      0.00      0.00      0.00      0.00      0.00    100.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3      0.00      0.00      0.00      0.00      0.00    100.00

Rewrite this code into multithreaded processing.

Make it multi-threaded

Since it is troublesome, I will write the process quickly with OpenMP.

static long *
and_matrix(long *a, long *b, long num)
{
  long *m;
  long i, j, k;
  long index;

  m = (long*)ruby_xmalloc(sizeof(long) * num * num);

  #pragma omp parallel for private(j, k, index)
  for(i = 0; i < num; i++) {
    for(j = 0; j < num; j++) {
      index = i * num + j;
      m[index] = 0;
      for(k = 0; k < num; k++) {
        m[index] += a[i * num + k] * b[k * num + j];
      }
    }
  }

  return m;
}

When I run it,

0.13279788196086884

Well, it's faster even though I haven't released GVL. That's because it doesn't go through Ruby Thread. Considering that the value is close to a quarter, it seems that the four cores are being used effectively. It's too fast to measure properly with sar, so let's calculate with a 2048x2048 matrix.

10.213115755468607

It takes 10 seconds as expected. The CPU usage is as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     99.50      0.00      0.50      0.00      0.00      0.00
  0    100.00      0.00      0.00      0.00      0.00      0.00
  1    100.00      0.00      0.00      0.00      0.00      0.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3    100.00      0.00      0.00      0.00      0.00      0.00

We are now making full use of all cores.

It seems like you're not making full use of all the cores in Ruby, you're just writing thread programming in C, but are you dreaming about it? I feel like I lost somehow, so I will try to make full use of all the cores after using Thread of Ruby. I will release GVL as originally planned.

Write a C extension library for releasing GVL

In order to release GVL, you need to write it in the C extension library, but in addition, you should not touch Ruby functions while releasing GVL. For this reason, we provide a method to perform one column x one row calculation by having two matrices converted to long * in the instance of the C extension library. It is an image that calls this method in the Thread process of Ruby.

typedef struct _matrix {
  long *a;
  long *b;
  long num;
} *matrix;

struct and_data {
  matrix m;
  long i;
  long j;
  long result;
};

...

static VALUE
and(VALUE self, VALUE i, VALUE j)
{
  matrix m;
  struct and_data data;

  Data_Get_Struct(self, struct _matrix, m);
  data.m = m;
  data.i = NUM2LONG(i);
  data.j = NUM2LONG(j);

  and_matrix(&data);

  return LONG2NUM(data.result);
}

static VALUE
new(VALUE klass, VALUE a, VALUE b, VALUE num)
{
  matrix m;
  VALUE obj;

  obj = Data_Make_Struct(klass, struct _matrix, NULL, destroy_matix, m);

  m->num = NUM2LONG(num);
  m->a = ary_to_cary(a, m->num);
  m->b = ary_to_cary(b, m->num);

  return obj;
}

void
Init_nonblock_matrix(void)
{
  VALUE cNBMatrix = rb_define_class("NonblockMatrix", rb_cObject);

  rb_define_singleton_method(cNBMatrix, "new", new, 3);
  rb_define_method(cNBMatrix, "m_and", and, 2);
}

I've omitted the destroy_matix function (just call it ruby_xfree).

The ʻand_matrix` function is as follows.

static void *
and_matrix(void *data)
{
  long i, j, num, k, result;
  long *a, *b;
  struct and_data *d;

  d = (struct and_data*)data;

  num = d->m->num;
  a = d->m->a;
  b = d->m->b;
  i = d->i;
  j = d->j;

  result = 0;
  for(k = 0; k < num; k++) {
    result += a[i * num + k] * b[k * num + j];
  }

  d->result = result;

  return NULL;
}

The reason why ʻand_matrix receives an argument with void * and returns a value with void * `is to make it easier to insert the GVL release process later.

This library is used in thread processing code written in Ruby.

require './nonblock_matrix'
def t_m_and2(a, b, num, t_size)
  m = []
  queue = Queue.new
  threads = []
  nb = NonblockMatrix.new(a, b, num) #Use C extension library

  num.times do |i|
    m[i] = []
    num.times do |j|
      queue.push([i, j])
    end
  end

  t_size.times do
    threads << Thread.new(queue) do |q|
      until q.empty?
        begin
          ti, tj = q.pop(true)
          m[ti][tj] = nb.m_and(ti, tj) #Calculation processing of one column x one row implemented in the C extension library
        rescue ThreadError
          raise unless q.empty?
        end
      end
    end
  end
  threads.each(&:join)

  m
end

I will try it.

0.48769768700003624

It's 488 milliseconds! Even just changing the process here to C has made it considerably faster. It is as fast as writing all the processing in C. Naturally, the threads run one by one, so the CPU usage is as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0     26.00      0.00      1.00      0.00      0.00     73.00
  1     20.00      0.00      0.00      0.00      0.00     80.00
  2     30.00      0.00      0.00      0.00      0.00     70.00
  3     23.23      0.00      0.00      0.00      0.00     76.77

GVL release!

Finally, release GVL. Call the ʻand_matrixfunction implemented in the C extension library after releasing the GVL. To free the GVL and call a function, use therb_thread_call_without_gvl` function.

rb_thread_call_without_gvl(and_matrix, &data, RUBY_UBF_PROCESS, NULL);

The function specified by the rb_thread_call_without_gvl function must take an argument withvoid *and return a value withvoid *. It's easy because ʻand_matrix` defined the interface for this.

I will try it.

1.4591152295470238

That's late.

Let's look at the CPU usage.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     56.11      0.00     10.28      0.00      0.00     33.61
  0     60.44      0.00     14.29      0.00      0.00     25.27
  1     61.80      0.00      8.99      0.00      0.00     29.21
  2     42.86      0.00      9.89      0.00      0.00     47.25
  3     58.43      0.00      7.87      0.00      0.00     33.71

We didn't make full use of the CPU core, but we were able to exceed 60%. However, it seems that it is doing a lot of useless things because it is slow even though it uses CPU resources. The value of system is jumping up.

There was comment like below in thread.c of ruby.

 NOTE: Releasing GVL and re-acquiring GVL may be expensive operations
       for a short running `func()'. Be sure to benchmark and use this
       mechanism when `func()' consumes enough time.

It seems that the cost will be higher if you call a function that ends immediately like this time many times.

The end

By the way, it took 1467 seconds to calculate the 2048x2048 matrix with the first ruby code. : cat2:

Tomorrow's article is @ maguhiro's "I made a Bitrise Step for sending messages to MS Teams".

Recommended Posts

I tried to make full use of the CPU core in Ruby
I tried to make a parent class of a value object in Ruby
I tried to summarize the basic grammar of Ruby briefly
I tried to make a client of RESAS-API in Java
I tried to solve the problem of "multi-stage selection" with Ruby
I tried to make Numeron which is not good in Ruby
I want to change the value of Attribute in Selenium of Ruby
I tried to organize the session in Rails
I want to use arrow notation in Ruby
I tried to make the "Select File" button of the sample application created in the Rails tutorial cool
I tried to solve the tribonacci sequence problem in Ruby, with recursion.
[Ruby] I tried to summarize the methods that frequently appear in paiza
[Ruby] I tried to summarize the methods that frequently appear in paiza ②
I want to get the value in Ruby
I tried to make a sample program using the problem of database specialist in Domain Driven Design
Use hashes well in Ruby to calculate the total amount of an order
I tried to solve the tribonatch sequence problem in Ruby (time limit 10 minutes)
I tried to organize the cases used in programming
I tried to summarize the state transition of docker
05. I tried to stub the source of Spring Boot
I tried to make a login function in Java
I tried to implement the Euclidean algorithm in Java
Since the du command used when the capacity is full is difficult to use, I tried wrapping it with ruby
[Ruby] I want to make a program that displays today's day of the week!
I tried to make an application in 3 months from inexperienced
I tried to summarize the basics of kotlin and java
[Swift] I tried to implement the function of the vending machine
What I did in the version upgrade from Ruby 2.5.2 to 2.7.1
I tried to build the environment of WSL2 + Docker + VSCode
[Ruby] I want to reverse the order of the hash table
I finished watching The Rose of Versailles, so I tried to reproduce the ending song in Java
I tried to solve the Ruby karaoke machine problem (there is an example of the answer)
I tried to explain the method
I tried to solve the Ruby bonus drink problem (there is an example of the answer)
I tried to develop the cache function of Application Container Cloud Service in the local environment
I tried to summarize the words that I often see in docker-compose.yml
I tried to illuminate the Christmas tree in a life game
I tried to sort the data in descending order, ascending order / Rails
I tried to build the environment of PlantUML Server with Docker
I tried to write code like a type declaration in Ruby
[Ruby] Tonight, I tried to summarize the loop processing [times, break ...]
I tried setting Java beginners to use shortcut keys in eclipse
I tried to check the operation of gRPC server with grpcurl
I tried to summarize the methods of Java String and StringBuilder
I tried to solve the problem of Google Tech Dev Guide
I tried to make it possible to set the delay for the UDP client of Android by myself
I tried to solve the Ruby bingo card creation problem (there is an example of the answer)
After learning Progate, I tried to make an SNS application using Rails in the local environment
I tried a calendar problem in Ruby
I tried to summarize the methods used
I tried the new era in Java
I tried to implement the Iterator pattern
I tried to summarize the Stream API
I tried the AutoValue library in Intellij
I tried to build Ruby 3.0.0 from source
I want to use @Autowired in Servlet
I tried to use Selenium like JQuery
[Ruby] I want to output only the odd-numbered characters in the character string
[IOS] Make full use of CIFilter to convert negative film photos to negative / positive
Procedure to make the value of the property file visible in Spring Boot
[Ruby] I tried to diet the if statement code with the ternary operator