This article was supposed to be posted to Raspberry Pi Advent Calendar 2015 a week ago. This is a continuation of Matrix multiplication with GPU of Raspberry Pi (1). I am very sorry that I passed the schedule.

Last time, I made a part to calculate the 16x64 matrix in the figure below in parallel with SIMD, so I will make other parts to complete the matrix multiplication code.

# Implementation of i-loop and j-loop

i-loop and j-loop are executed very few times compared to k-loop, so it's easy to do.

About j-loop

for(j = 0; j < r; j+=64) {
   body
}

When translated obediently into assembly style

j = 0

j_loop_begin:

if(j >= r) goto j_loop_exit

body

j += 64
goto j_loop_begin

j_loop_exit:

However, there is a useless jump as it is. Jumping takes time, and if the Instruction Cache entry is kicked out, the performance of other parts will be affected. You can rewrite it so that there is only one jump in the loop as shown below.

j = 0
if(j >= r) goto j_loop_exit

j_loop_begin:

body

j += 64
if(j < r) goto j_loop_begin
j_loop_exit:

The very first j> = r is always unsuccessful and can be omitted.

j = 0

j_loop_begin:

body

j += 64
if(j < r) goto j_loop_begin

j_loop_exit:

In other words, it will be in the form of a do-while type loop as shown below.

j = 0
do {
    body

    j += 64
} while (j < r);

Also, than the shape above

j = r
do {
    body

    j -= 64
} while(j > 0);

I think the form is better. The point is that the latter reduces the number of accesses to r. (As a bonus, the number of operations is reduced by one. The former was twice j + 64 and j --r, but the latter is once j --64.)

j = 0
do {
    body

    j += 64
} while (j < r);  <-Read r here every time


j = r <-Only once here
do {
    body

    j -= 64
} while(j > 0);

Variables with low access frequency have little effect even if they are placed in a slow place, so they can be saved in memory or by the method introduced last time. Then, the saved registers and accumulators can be effectively used in the loop body, which indirectly leads to performance improvement. Write i-loop in the same way.

Software pipelining

I thought about doing software pipelining, which I mentioned last time, but I haven't done it because I've run out of exclusive control, which will be described later.

I will explain what I tried to do. If you write this program in pseudo code, it will be as follows. As I explained last time, I add the 16x16 matrix in 4 steps, but there is time to do no calculation at the point where the 4th block is transferred to the host.

for (j) {

Address calculation and cache initialization

    k-The part that protrudes forward from the loop
    for (k) {
Calculate the product of the vectors of A and B
    }
While loading the first block, k-Calculate the amount that protrudes behind the loop
Load the second block and calculate the first block
Calculate the second block while storing the first block and loading the third block
Calculate the 3rd block while storing the 2nd block and loading the 4th block
Calculate the 4th block while storing the 3rd block
Store the 4th block<-I haven't calculated anything!

Conditional branch(Go back to the beginning or break out of the loop)
}

In order to fill such waiting time (latency), the operation of bringing in instructions from the next iteration is called software pipelining. In some cases, you may bring it from the next and subsequent iterations. VideoCore IV does not (probably) perform out-of-order execution of instructions like expensive GPUs, so I think this kind of effort is important.

Of course, latency can also be hidden by making it multithreaded. It's a warp in NVIDIA GPU. VideoCore IV QPU can run 2 threads at the same time.

Benchmark with single thread

Up to this point, I can run using one QPU, so I took a benchmark. The machine used the Raspberry Pi Zero that I bought the other day. The installed CPU and GPU are the same as Raspberry Pi 1, and only the CPU clock is improved to 1Ghz.

The source code used is as follows.

https://github.com/nineties/py-videocore/blob/master/examples/sgemm_1thread.py

Below are the results. Here, the type of A is 96x363 and the type of B is 363x3072 according to the pi-gemm.

Implementation	Number of threads	Execution time	Measured performance
numpy(BLAS)	CPU 1 thread	3.05 seconds	0.070 Gflops
pi-gemm	QPU 12 threads	0.21 seconds	1.02 Gflops
my	QPU 1 thread	0.23 seconds	0.95 Gflops

I was able to make it fast enough to catch up with pi-gemm with just one QPU. However, since the theoretical performance of one QPU is 2Gflops, it is a pity that the performance was only about 50% of that. When I measured various things, Uniforms Cache was slower than I expected, and it took about 2 instructions for one load with 1 QPU. Then, the length of the k-loop body is roughly doubled, which is about 50%. Increasing the number of QPUs will increase cache misses, further reducing efficiency. I was expecting it to some extent because one load is only 4 bytes and the distance is short with SoC.

    L.k_loop

    fadd(ra1,  ra1,  r0).fmul(r0, r4, uniform)  #These guys are about 7~Like about 8 clocks(2 instructions)
    fadd(rb1,  rb1,  r0).fmul(r0, r4, uniform)

    ...

    fadd(rb31, rb31, r0, sig='load tmu0').mov(uniforms_address, r2)
    iadd(r2, r2, r3).mov(tmu0_s, r1)
    jzc(L.k_loop)
    iadd(r1, r1, 4).fmul(r0, r4, uniform)      # delay slot #This guy looks like 30 clocks
    fadd(ra0,  ra0,  r0).fmul(r0, r4, uniform) # delay slot
    fadd(rb0,  rb0,  r0).fmul(r0, r4, uniform) # delay slot

I'm wondering how to improve this.

The TMU has an L1 Cache and seems to be faster than the Uniforms Cache, but it consumes the ALU for address calculation, so it is not suitable for reading different vectors one after another. It seems that there is a way to read the first two vectors with TMU and calculate the direct product while repeating rotate and broadcast for one vector. This also consumes the ALU with rotate, but since the number of times to hit the cache is reduced, the performance when increasing the QPU may be better.

Actually, it wasn't so slow between QPU and VPM, and when there was only one QPU, both read / write could be done without stall. However, VPM is shared by all QPUs, so increasing the QPUs will eventually slow down. Also, if you read the matrix A or B with VPM, there is a problem that the number of DMAs increases significantly.

For the time being, I think the first step is to measure the performance of each cache and examine the internal structure. The above is a future task and we will move on.

[Addition] I thought that Uniforms Cache could not be software prefetched due to the mechanism, but since L2 cache is shared with TMU, it seems that Uniforms can be pulled up to L2 in advance by using TMU.

Quoted from VideoCore IV 3D Architecture Reference Guide

Parallelization between QPUs

VideoCore has 12 QPUs. This time, each QPU runs one thread, for a total of 12 threads. Therefore, as shown in the figure below, the matrix $ A $ is divided into several parts horizontally, and $ B $ is divided into several parts vertically, and the product of these is assigned to each QPU.

Synchronous control / exclusive control

Like a regular multithreaded program, it requires exclusive control over access to shared resources. For that reason, VideoCore IV has one mutex and 16 4-bit semaphores.

I will write synchronization control using these, but this time I omitted to enclose the part that needs synchronization with a whole mutex. Of course, it should have some impact on performance ...

The resources shared by QPU are mainly

VPM --DMA controller (VCD for load, VDW for store) --Registers for these settings and startup

Etc., but

Only two VPM read setups can be issued at a time and will be ignored when the queue is full. (Reference Guide P56)

And

DMA load cannot be issued until the previous one is finished. The same applies to the store. (Same as P56)

There are some restrictions like. Neither of them does the kindness like "stall until the end of the previous", and the request is ignored or Raspi itself stops. Also, there seem to be some restrictions not mentioned in the reference guide (especially regarding the extra stride setup register). It was quite a penance because the bug of my own assembler overlapped with this.

Use semaphores to handle situations where the number of resources is limited, such as the former "only up to two". Keep one of all threads as the master thread, and first it raises the semaphore by 2. Threads that want to read VPM will lower this semaphore by one and raise it by one when they are done using it. Threads that try to lower it below 0 will be in a wait state, so you can use up to two threads at the same time. Locking can also be done with a semaphore.

I didn't use a semaphore in this part because I surrounded it with a whole mutex this time, but I use it in the place to synchronize the end of the thread. The master thread must wait for all threads to complete their calculations before issuing an interrupt to the host. If there are 12 threads, the synchronization procedure is as follows.

Each thread raises the semaphore by 1 when it finishes its calculation.
After the master thread has lowered the semaphore 12 times, it issues an interrupt to the host and terminates.

It's very simple. Below is the code for that part.

    sema_up(COMPLETED)  # Notify completion to the thread 0

    ...

    mov(null, uniform, set_flags=True)  # thread index

    jzc(L.skip_fin)
    nop(); nop(); nop()

    # Only thread 0 enters here.
    for i in range(n_threads):
        sema_down(COMPLETED)    # Wait completion of all threads.

    interrupt()

    L.skip_fin

    exit(interrupt=False)

Actually, I was thinking of making the processing of four blocks into a pipeline as shown in the figure below. I may try it when I have time.

benchmark

The code is below.

https://github.com/nineties/py-videocore/blob/master/examples/sgemm.py

As before, the matrix size is 96x363 for A and 363x3072 for B. With 12 threads, A was split into two and B was split into six, which was the fastest. I imagine that the cache for A (TMU) and the cache for B (Uniforms) are well balanced around here. I will do detailed research soon.

Implementation	Number of threads	Number of divisions of A	Number of divisions of B	Execution time	Measured performance
numpy(BLAS)	CPU1 thread	-	-	3.05 seconds	0.070 Gflops
pi-gemm	QPU 12 threads	-	-	0.21 seconds	1.02 Gflops
my	QPU 1 thread	1	1	0.23 seconds	0.95 Gflops
my	QPU 12 threads	2	6	0.026 seconds	8.32 Gflops

Summary

--I tried single precision matrix multiplication on the GPU of Raspberry Pi. We were able to produce about 33% of the theoretical performance, about 8 Gflops. I think there is still room for effort, so I will do my best. ――It is about 120 times faster than numpy + BLAS and about 8 times faster than pi-gemm.

Just calculating (with double precision) with numpy + BLAS on my laptop (core i7-5600U 2.6GHz) is about the same speed. In other words, even if you try hard to use the GPU of Raspberry Pi, unfortunately it is faster to calculate with a normal computer. Pi-Zero is 600 yen, so it may be profitable if it is a price / performance ratio.

Matrix multiplication on Raspberry Pi GPU (Part 2)