I've been a programmer for Tabelog since 2008, and I'm still a programmer Oishii Tsukasa. The reason why the name is hiragana is to prevent people in the company from revealing their activities in an era when it was rare to be active on the Internet. I have a black history that had a homepage called "Tsukasa's Room" in 1995. It is not great? Wouldn't it be super bad if someone in the company found out? : cat2: I tried to put a cat emoji without any meaning. I tried to play a little. Maybe I'm not going to read it anymore?

Make full use of CPU core in Ruby

Even if thread programming is done in Ruby, the core of the CPU cannot be utilized. This is because Ruby always has only one thread running at the same time due to Giant VM lock (GVL). When waiting for I / O, the GVL is released, allowing multiple threads to run at the same time. It will work effectively when you hit multiple URLs to fetch data.

This is not the case when performing numerical calculations. For example, let's say you want to calculate the product of two 512x512 matrices.

#Make two 512x512 matrices
def build(num, max)
  m = []
  
  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = rand(max)
    end
  end
  
  m
end

num = 512
max = 256

a = build(num, max)
b = build(num, max)

Calculate the product of these two matrices, ʻa and b`. To be honest, the amount of calculation is $ O (n ^ {3}) $ because it requires triple loops.

def m_and(a, b, num)
  m = []
  
  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = 0

      num.times do |k|
        m[i][j] += a[i][k] * b[k][j]
      end
    end
  end
  
  m
end

I will measure the processing time.

require 'benchmark'
puts Benchmark.realtime { m_and(a, b, num) }

18.133936062455177

It took about 18 seconds. The machine I ran has four CPU cores. Let's take a look at the CPU usage during processing.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0      0.00      0.00      0.00      0.00      0.00    100.00
  1      0.00      0.00      1.00      0.00      0.00     99.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3      0.00      0.00      0.00      0.00      0.00    100.00

Only one core is doing its best.

Process with multiple threads

Let's process with multithreading. Calculation of one row x column can be performed independently in parallel, so let's divide the data in that unit and let each thread process it.

def t_m_and(a, b, num, t_size)
  m = []
  queue = Queue.new
  threads = []

  num.times do |i|
    m[i] = []
    num.times do |j|
      m[i][j] = 0
      queue.push([i, j])
    end
  end

  t_size.times do
    threads << Thread.new(queue) do |q|
      until q.empty?
        begin
          ti, tj = q.pop(true)
          num.times do |k|
            m[ti][tj] += a[ti][k] * b[k][tj]
          end
        rescue ThreadError
          raise unless q.empty?
        end
      end
    end
  end
  threads.each(&:join)

  m
end

Since there are 4 CPU cores, it is executed with 4 threads.

puts Benchmark.realtime { t_m_and(a, b, num, 4) }

18.22166531533003

This also took about 18 seconds. The CPU usage during execution was as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.06      0.00      0.00      0.00      0.00     74.94
  0     29.70      0.00      0.99      0.00      0.00     69.31
  1     28.00      0.00      0.00      0.00      0.00     72.00
  2     22.00      0.00      0.00      0.00      0.00     78.00
  3     20.20      0.00      0.00      0.00      0.00     79.80

I'm using all the cores properly, but since only one thread can move at the same time, the total time is the same as when processing sequentially. I will manage to make full use of all the cores.

Write with C extension library

If you release GVL, you can run multiple threads at the same time. In this matrix calculation, the processing of each thread should not affect each other, so it can be expected that there is no problem even if GVL is released. In order to release GVL, it is necessary to write an extension library in C, so first write the process of matrix multiplication in C language.

static VALUE
and(VALUE self, VALUE a, VALUE b, VALUE num)
{
  long *ma, *mb, *mc;
  long n;
  VALUE result;

  n = NUM2LONG(num);
  ma = ary_to_cary(a, n);
  mb = ary_to_cary(b, n);

  mc = and_matrix(ma, mb, n);
  result = cary_to_ary(mc, n);

  ruby_xfree(ma);
  ruby_xfree(mb);
  ruby_xfree(mc);

  return result;
}

void
Init_fullcore_matrix(void)
{
  VALUE cFullcoreMatrix = rb_define_class("FullcoreMatrix", rb_cObject);

  rb_define_method(cFullcoreMatrix, "m_and", and, 3);
}

The two-dimensional array is converted to a one-dimensional array for ease of use. ʻAry_to_caryconverts ruby's Array to c'slong * (not a standard function, I wrote it for this, though I haven't posted it). The ʻand_matrix function is as follows.

static long *
and_matrix(long *a, long *b, long num)
{
  long *m;
  long i, j, k;
  long index;

  m = (long*)ruby_xmalloc(sizeof(long) * num * num);

  for(i = 0; i < num; i++) {
    for(j = 0; j < num; j++) {
      index = i * num + j;
      m[index] = 0;
      for(k = 0; k < num; k++) {
        m[index] += a[i * num + k] * b[k * num + j];
      }
    }
  }

  return m;
}

Let's do this.

fm = FullcoreMatrix.new 
puts Benchmark.realtime { fm.m_and(a, b, num) }

0.39124061167240143

Overwhelmingly fast! The process that took 18 seconds is now 391 milliseconds.

The CPU usage is as follows, only one core is working hard.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0      0.00      0.00      1.00      0.00      0.00     99.00
  1      0.00      0.00      0.00      0.00      0.00    100.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3      0.00      0.00      0.00      0.00      0.00    100.00

Rewrite this code into multithreaded processing.

Make it multi-threaded

Since it is troublesome, I will write the process quickly with OpenMP.

static long *
and_matrix(long *a, long *b, long num)
{
  long *m;
  long i, j, k;
  long index;

  m = (long*)ruby_xmalloc(sizeof(long) * num * num);

  #pragma omp parallel for private(j, k, index)
  for(i = 0; i < num; i++) {
    for(j = 0; j < num; j++) {
      index = i * num + j;
      m[index] = 0;
      for(k = 0; k < num; k++) {
        m[index] += a[i * num + k] * b[k * num + j];
      }
    }
  }

  return m;
}

When I run it,

0.13279788196086884

Well, it's faster even though I haven't released GVL. That's because it doesn't go through Ruby Thread. Considering that the value is close to a quarter, it seems that the four cores are being used effectively. It's too fast to measure properly with sar, so let's calculate with a 2048x2048 matrix.

10.213115755468607

It takes 10 seconds as expected. The CPU usage is as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     99.50      0.00      0.50      0.00      0.00      0.00
  0    100.00      0.00      0.00      0.00      0.00      0.00
  1    100.00      0.00      0.00      0.00      0.00      0.00
  2    100.00      0.00      0.00      0.00      0.00      0.00
  3    100.00      0.00      0.00      0.00      0.00      0.00

We are now making full use of all cores.

It seems like you're not making full use of all the cores in Ruby, you're just writing thread programming in C, but are you dreaming about it? I feel like I lost somehow, so I will try to make full use of all the cores after using Thread of Ruby. I will release GVL as originally planned.

Write a C extension library for releasing GVL

In order to release GVL, you need to write it in the C extension library, but in addition, you should not touch Ruby functions while releasing GVL. For this reason, we provide a method to perform one column x one row calculation by having two matrices converted to long * in the instance of the C extension library. It is an image that calls this method in the Thread process of Ruby.

typedef struct _matrix {
  long *a;
  long *b;
  long num;
} *matrix;

struct and_data {
  matrix m;
  long i;
  long j;
  long result;
};

...

static VALUE
and(VALUE self, VALUE i, VALUE j)
{
  matrix m;
  struct and_data data;

  Data_Get_Struct(self, struct _matrix, m);
  data.m = m;
  data.i = NUM2LONG(i);
  data.j = NUM2LONG(j);

  and_matrix(&data);

  return LONG2NUM(data.result);
}

static VALUE
new(VALUE klass, VALUE a, VALUE b, VALUE num)
{
  matrix m;
  VALUE obj;

  obj = Data_Make_Struct(klass, struct _matrix, NULL, destroy_matix, m);

  m->num = NUM2LONG(num);
  m->a = ary_to_cary(a, m->num);
  m->b = ary_to_cary(b, m->num);

  return obj;
}

void
Init_nonblock_matrix(void)
{
  VALUE cNBMatrix = rb_define_class("NonblockMatrix", rb_cObject);

  rb_define_singleton_method(cNBMatrix, "new", new, 3);
  rb_define_method(cNBMatrix, "m_and", and, 2);
}

I've omitted the destroy_matix function (just call it ruby_xfree).

The ʻand_matrix` function is as follows.

static void *
and_matrix(void *data)
{
  long i, j, num, k, result;
  long *a, *b;
  struct and_data *d;

  d = (struct and_data*)data;

  num = d->m->num;
  a = d->m->a;
  b = d->m->b;
  i = d->i;
  j = d->j;

  result = 0;
  for(k = 0; k < num; k++) {
    result += a[i * num + k] * b[k * num + j];
  }

  d->result = result;

  return NULL;
}

The reason why ʻand_matrix receives an argument with void * and returns a value with void * `is to make it easier to insert the GVL release process later.

This library is used in thread processing code written in Ruby.

require './nonblock_matrix'
def t_m_and2(a, b, num, t_size)
  m = []
  queue = Queue.new
  threads = []
  nb = NonblockMatrix.new(a, b, num) #Use C extension library

  num.times do |i|
    m[i] = []
    num.times do |j|
      queue.push([i, j])
    end
  end

  t_size.times do
    threads << Thread.new(queue) do |q|
      until q.empty?
        begin
          ti, tj = q.pop(true)
          m[ti][tj] = nb.m_and(ti, tj) #Calculation processing of one column x one row implemented in the C extension library
        rescue ThreadError
          raise unless q.empty?
        end
      end
    end
  end
  threads.each(&:join)

  m
end

I will try it.

0.48769768700003624

It's 488 milliseconds! Even just changing the process here to C has made it considerably faster. It is as fast as writing all the processing in C. Naturally, the threads run one by one, so the CPU usage is as follows.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     25.00      0.00      0.25      0.00      0.00     74.75
  0     26.00      0.00      1.00      0.00      0.00     73.00
  1     20.00      0.00      0.00      0.00      0.00     80.00
  2     30.00      0.00      0.00      0.00      0.00     70.00
  3     23.23      0.00      0.00      0.00      0.00     76.77

GVL release!

Finally, release GVL. Call the ʻand_matrixfunction implemented in the C extension library after releasing the GVL. To free the GVL and call a function, use therb_thread_call_without_gvl` function.

rb_thread_call_without_gvl(and_matrix, &data, RUBY_UBF_PROCESS, NULL);

The function specified by the rb_thread_call_without_gvl function must take an argument withvoid *and return a value withvoid *. It's easy because ʻand_matrix` defined the interface for this.

I will try it.

1.4591152295470238

That's late.

Let's look at the CPU usage.

CPU     %user     %nice   %system   %iowait    %steal     %idle
all     56.11      0.00     10.28      0.00      0.00     33.61
  0     60.44      0.00     14.29      0.00      0.00     25.27
  1     61.80      0.00      8.99      0.00      0.00     29.21
  2     42.86      0.00      9.89      0.00      0.00     47.25
  3     58.43      0.00      7.87      0.00      0.00     33.71

We didn't make full use of the CPU core, but we were able to exceed 60%. However, it seems that it is doing a lot of useless things because it is slow even though it uses CPU resources. The value of system is jumping up.

There was comment like below in thread.c of ruby.

 NOTE: Releasing GVL and re-acquiring GVL may be expensive operations
       for a short running `func()'. Be sure to benchmark and use this
       mechanism when `func()' consumes enough time.

It seems that the cost will be higher if you call a function that ends immediately like this time many times.

The end

By the way, it took 1467 seconds to calculate the 2048x2048 matrix with the first ruby code. : cat2:

Tomorrow's article is @ maguhiro's "I made a Bitrise Step for sending messages to MS Teams".

I tried to make full use of the CPU core in Ruby