I've been a programmer for Tabelog since 2008, and I'm still a programmer Oishii Tsukasa. The reason why the name is hiragana is to prevent people in the company from revealing their activities in an era when it was rare to be active on the Internet. I have a black history that had a homepage called "Tsukasa's Room" in 1995. It is not great? Wouldn't it be super bad if someone in the company found out? : cat2: I tried to put a cat emoji without any meaning. I tried to play a little. Maybe I'm not going to read it anymore?
Even if thread programming is done in Ruby, the core of the CPU cannot be utilized. This is because Ruby always has only one thread running at the same time due to Giant VM lock (GVL). When waiting for I / O, the GVL is released, allowing multiple threads to run at the same time. It will work effectively when you hit multiple URLs to fetch data.
This is not the case when performing numerical calculations. For example, let's say you want to calculate the product of two 512x512 matrices.
#Make two 512x512 matrices
def build(num, max)
m = []
num.times do |i|
m[i] = []
num.times do |j|
m[i][j] = rand(max)
end
end
m
end
num = 512
max = 256
a = build(num, max)
b = build(num, max)
Calculate the product of these two matrices, ʻa and
b`. To be honest, the amount of calculation is $ O (n ^ {3}) $ because it requires triple loops.
def m_and(a, b, num)
m = []
num.times do |i|
m[i] = []
num.times do |j|
m[i][j] = 0
num.times do |k|
m[i][j] += a[i][k] * b[k][j]
end
end
end
m
end
I will measure the processing time.
require 'benchmark'
puts Benchmark.realtime { m_and(a, b, num) }
18.133936062455177
It took about 18 seconds. The machine I ran has four CPU cores. Let's take a look at the CPU usage during processing.
CPU %user %nice %system %iowait %steal %idle
all 25.00 0.00 0.25 0.00 0.00 74.75
0 0.00 0.00 0.00 0.00 0.00 100.00
1 0.00 0.00 1.00 0.00 0.00 99.00
2 100.00 0.00 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 100.00
Only one core is doing its best.
Let's process with multithreading. Calculation of one row x column can be performed independently in parallel, so let's divide the data in that unit and let each thread process it.
def t_m_and(a, b, num, t_size)
m = []
queue = Queue.new
threads = []
num.times do |i|
m[i] = []
num.times do |j|
m[i][j] = 0
queue.push([i, j])
end
end
t_size.times do
threads << Thread.new(queue) do |q|
until q.empty?
begin
ti, tj = q.pop(true)
num.times do |k|
m[ti][tj] += a[ti][k] * b[k][tj]
end
rescue ThreadError
raise unless q.empty?
end
end
end
end
threads.each(&:join)
m
end
Since there are 4 CPU cores, it is executed with 4 threads.
puts Benchmark.realtime { t_m_and(a, b, num, 4) }
18.22166531533003
This also took about 18 seconds. The CPU usage during execution was as follows.
CPU %user %nice %system %iowait %steal %idle
all 25.06 0.00 0.00 0.00 0.00 74.94
0 29.70 0.00 0.99 0.00 0.00 69.31
1 28.00 0.00 0.00 0.00 0.00 72.00
2 22.00 0.00 0.00 0.00 0.00 78.00
3 20.20 0.00 0.00 0.00 0.00 79.80
I'm using all the cores properly, but since only one thread can move at the same time, the total time is the same as when processing sequentially. I will manage to make full use of all the cores.
If you release GVL, you can run multiple threads at the same time. In this matrix calculation, the processing of each thread should not affect each other, so it can be expected that there is no problem even if GVL is released. In order to release GVL, it is necessary to write an extension library in C, so first write the process of matrix multiplication in C language.
static VALUE
and(VALUE self, VALUE a, VALUE b, VALUE num)
{
long *ma, *mb, *mc;
long n;
VALUE result;
n = NUM2LONG(num);
ma = ary_to_cary(a, n);
mb = ary_to_cary(b, n);
mc = and_matrix(ma, mb, n);
result = cary_to_ary(mc, n);
ruby_xfree(ma);
ruby_xfree(mb);
ruby_xfree(mc);
return result;
}
void
Init_fullcore_matrix(void)
{
VALUE cFullcoreMatrix = rb_define_class("FullcoreMatrix", rb_cObject);
rb_define_method(cFullcoreMatrix, "m_and", and, 3);
}
The two-dimensional array is converted to a one-dimensional array for ease of use. ʻAry_to_caryconverts ruby's Array to c's
long * (not a standard function, I wrote it for this, though I haven't posted it). The ʻand_matrix
function is as follows.
static long *
and_matrix(long *a, long *b, long num)
{
long *m;
long i, j, k;
long index;
m = (long*)ruby_xmalloc(sizeof(long) * num * num);
for(i = 0; i < num; i++) {
for(j = 0; j < num; j++) {
index = i * num + j;
m[index] = 0;
for(k = 0; k < num; k++) {
m[index] += a[i * num + k] * b[k * num + j];
}
}
}
return m;
}
Let's do this.
fm = FullcoreMatrix.new
puts Benchmark.realtime { fm.m_and(a, b, num) }
0.39124061167240143
Overwhelmingly fast! The process that took 18 seconds is now 391 milliseconds.
The CPU usage is as follows, only one core is working hard.
CPU %user %nice %system %iowait %steal %idle
all 25.00 0.00 0.25 0.00 0.00 74.75
0 0.00 0.00 1.00 0.00 0.00 99.00
1 0.00 0.00 0.00 0.00 0.00 100.00
2 100.00 0.00 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00 0.00 100.00
Rewrite this code into multithreaded processing.
Since it is troublesome, I will write the process quickly with OpenMP.
static long *
and_matrix(long *a, long *b, long num)
{
long *m;
long i, j, k;
long index;
m = (long*)ruby_xmalloc(sizeof(long) * num * num);
#pragma omp parallel for private(j, k, index)
for(i = 0; i < num; i++) {
for(j = 0; j < num; j++) {
index = i * num + j;
m[index] = 0;
for(k = 0; k < num; k++) {
m[index] += a[i * num + k] * b[k * num + j];
}
}
}
return m;
}
When I run it,
0.13279788196086884
Well, it's faster even though I haven't released GVL. That's because it doesn't go through Ruby Thread. Considering that the value is close to a quarter, it seems that the four cores are being used effectively.
It's too fast to measure properly with sar
, so let's calculate with a 2048x2048 matrix.
10.213115755468607
It takes 10 seconds as expected. The CPU usage is as follows.
CPU %user %nice %system %iowait %steal %idle
all 99.50 0.00 0.50 0.00 0.00 0.00
0 100.00 0.00 0.00 0.00 0.00 0.00
1 100.00 0.00 0.00 0.00 0.00 0.00
2 100.00 0.00 0.00 0.00 0.00 0.00
3 100.00 0.00 0.00 0.00 0.00 0.00
We are now making full use of all cores.
It seems like you're not making full use of all the cores in Ruby, you're just writing thread programming in C, but are you dreaming about it? I feel like I lost somehow, so I will try to make full use of all the cores after using Thread of Ruby. I will release GVL as originally planned.
In order to release GVL, you need to write it in the C extension library, but in addition, you should not touch Ruby functions while releasing GVL.
For this reason, we provide a method to perform one column x one row calculation by having two matrices converted to long *
in the instance of the C extension library. It is an image that calls this method in the Thread process of Ruby.
typedef struct _matrix {
long *a;
long *b;
long num;
} *matrix;
struct and_data {
matrix m;
long i;
long j;
long result;
};
...
static VALUE
and(VALUE self, VALUE i, VALUE j)
{
matrix m;
struct and_data data;
Data_Get_Struct(self, struct _matrix, m);
data.m = m;
data.i = NUM2LONG(i);
data.j = NUM2LONG(j);
and_matrix(&data);
return LONG2NUM(data.result);
}
static VALUE
new(VALUE klass, VALUE a, VALUE b, VALUE num)
{
matrix m;
VALUE obj;
obj = Data_Make_Struct(klass, struct _matrix, NULL, destroy_matix, m);
m->num = NUM2LONG(num);
m->a = ary_to_cary(a, m->num);
m->b = ary_to_cary(b, m->num);
return obj;
}
void
Init_nonblock_matrix(void)
{
VALUE cNBMatrix = rb_define_class("NonblockMatrix", rb_cObject);
rb_define_singleton_method(cNBMatrix, "new", new, 3);
rb_define_method(cNBMatrix, "m_and", and, 2);
}
I've omitted the destroy_matix
function (just call it ruby_xfree
).
The ʻand_matrix` function is as follows.
static void *
and_matrix(void *data)
{
long i, j, num, k, result;
long *a, *b;
struct and_data *d;
d = (struct and_data*)data;
num = d->m->num;
a = d->m->a;
b = d->m->b;
i = d->i;
j = d->j;
result = 0;
for(k = 0; k < num; k++) {
result += a[i * num + k] * b[k * num + j];
}
d->result = result;
return NULL;
}
The reason why ʻand_matrix receives an argument with
void * and returns a value with
void * `is to make it easier to insert the GVL release process later.
This library is used in thread processing code written in Ruby.
require './nonblock_matrix'
def t_m_and2(a, b, num, t_size)
m = []
queue = Queue.new
threads = []
nb = NonblockMatrix.new(a, b, num) #Use C extension library
num.times do |i|
m[i] = []
num.times do |j|
queue.push([i, j])
end
end
t_size.times do
threads << Thread.new(queue) do |q|
until q.empty?
begin
ti, tj = q.pop(true)
m[ti][tj] = nb.m_and(ti, tj) #Calculation processing of one column x one row implemented in the C extension library
rescue ThreadError
raise unless q.empty?
end
end
end
end
threads.each(&:join)
m
end
I will try it.
0.48769768700003624
It's 488 milliseconds! Even just changing the process here to C has made it considerably faster. It is as fast as writing all the processing in C. Naturally, the threads run one by one, so the CPU usage is as follows.
CPU %user %nice %system %iowait %steal %idle
all 25.00 0.00 0.25 0.00 0.00 74.75
0 26.00 0.00 1.00 0.00 0.00 73.00
1 20.00 0.00 0.00 0.00 0.00 80.00
2 30.00 0.00 0.00 0.00 0.00 70.00
3 23.23 0.00 0.00 0.00 0.00 76.77
Finally, release GVL.
Call the ʻand_matrixfunction implemented in the C extension library after releasing the GVL. To free the GVL and call a function, use the
rb_thread_call_without_gvl` function.
rb_thread_call_without_gvl(and_matrix, &data, RUBY_UBF_PROCESS, NULL);
The function specified by the rb_thread_call_without_gvl
function must take an argument withvoid *
and return a value withvoid *
. It's easy because ʻand_matrix` defined the interface for this.
I will try it.
1.4591152295470238
That's late.
Let's look at the CPU usage.
CPU %user %nice %system %iowait %steal %idle
all 56.11 0.00 10.28 0.00 0.00 33.61
0 60.44 0.00 14.29 0.00 0.00 25.27
1 61.80 0.00 8.99 0.00 0.00 29.21
2 42.86 0.00 9.89 0.00 0.00 47.25
3 58.43 0.00 7.87 0.00 0.00 33.71
We didn't make full use of the CPU core, but we were able to exceed 60%.
However, it seems that it is doing a lot of useless things because it is slow even though it uses CPU resources. The value of system
is jumping up.
There was comment like below in thread.c
of ruby.
NOTE: Releasing GVL and re-acquiring GVL may be expensive operations
for a short running `func()'. Be sure to benchmark and use this
mechanism when `func()' consumes enough time.
It seems that the cost will be higher if you call a function that ends immediately like this time many times.
By the way, it took 1467 seconds to calculate the 2048x2048 matrix with the first ruby code. : cat2:
Tomorrow's article is @ maguhiro's "I made a Bitrise Step for sending messages to MS Teams".
Recommended Posts