I have created a Python library called ** PyVideoCore ** for GPGPU on Raspberry Pi, so I will introduce it.
The Raspberry Pi series is equipped with Broadcom's ** VideoCore IV ** mobile GPU. This GPU is officially published in the Reference Guide (https://docs.broadcom.com/docs/12358545). This was published by Broadcom as a birthday present to the Raspberry Pi Foundation in February 2014. Thanks to this document, it is possible to hack VideoCore.
VideoCore IV is equipped with 12 Quad Processing Units (** QPU **). Each QPU is a 16-way SIMD processor that performs 4 words x 4 cycles (that is, 16 lengths) of vector calculations with one instruction. Each QPU can execute two operations of addition system and integration system at the same time. In other words, the entire GPU can perform up to 12 x 4 x 2 = 96 operations at the same time. Since the clock is 250MHz, the theoretical performance is 96 x 0.25 = 24GFLOPS. Only single precision. Raspberry Pi 2 seems to be able to overclock up to about 500MHz.
In addition, there are three Special Function Units (SFU) independent of ALU, and RECIP, RECIPSQRT, EXP2, and LOG2 can be calculated. However, it uses 4 instructions (16 cycles), cannot be pipelined, and has poor accuracy (I have not experimented properly, but it seems that LOG2 is about 4 digits except for LOG2), so it contributes to the computing power of SFU. Is minute. Each QPU can run up to two hardware threads. In other words, you can run up to 24 threads at the same time. The assignment of threads to the QPU is done dynamically by the VideoCore scheduler. There is one mutex and 16 semaphores for synchronizing threads. There are several types of memory depending on how you use it, but it will be longer, so I will explain it later.
It's almost like this. The performance is not high because it is for mobile, but I think that it will be interesting to play with it because you can get the complete document, you can buy it for several thousand yen, and GPGPU other than NVIDIA is rare. Of course, if you are doing some kind of project with a smartphone equipped with Raspberry Pi or VideoCore IV, I would appreciate this computing power.
PyVideoCore
Unfortunately, VideoCore IV does not (probably) have a GPGPU development environment like CUDA or OpenCL, so you need to develop in assembly language for QPU. In the first place, there is no language or assembler. In the past, the following projects have been carried out, but it seems that each of them is developing their own assembler.
-Implementation of FFT -SHA256 implementation -Porting Deep Belief Image recognition SDK (Matrix multiplication (GEMM) is fast on GPU )
** PyVideoCore ** tried to implement assembly language as Python's internal DSL to make it a little easier to write. The following is a sample that just adds a 16 length float vector, but you can write the host side code and GPU side code in one file and execute it as a normal Python script without compiling.
import numpy as np
from videocore.assembler import qpu
from videocore.driver import Driver
@qpu
def hello_world(asm):
# Load two vectors of length 16 from the host memory (address=uniforms[0]) to VPM
setup_dma_load(nrows=2)
start_dma_load(uniform)
wait_dma_load()
# Setup VPM read/write operaitons
setup_vpm_read(nrows=2)
setup_vpm_write()
# Compute a + b
mov(r0, vpm)
mov(r1, vpm)
fadd(vpm, r0, r1)
# Store the result vector from VPM to the host memory (address=uniforms[1])
setup_dma_store(nrows=1)
start_dma_store(uniform)
wait_dma_store()
# Finish the thread
exit()
with Driver() as drv:
# Input vectors
a = np.random.random(16).astype('float32')
b = np.random.random(16).astype('float32')
# Copy vectors to shared memory for DMA transfer
inp = drv.copy(np.r_[a, b])
out = drv.alloc(16, 'float32')
# Run the program
drv.execute(
n_threads=1,
program=drv.program(hello_world),
uniforms=[inp.address, out.address]
)
print ' a '.center(80, '=')
print(a)
print ' b '.center(80, '=')
print(b)
print ' a+b '.center(80, '=')
print(out)
print ' error '.center(80, '=')
print(np.abs(a+b-out))
The assembly code has a decorator called @ qpu
. Currently, it is necessary to write a raw assembly, but the GPU code itself is a normal function, and since each instruction is also an ordinary function, it is possible to devise a library of frequently used patterns using Python functions. think.
Below is the repository. try it.
I'm thinking of taking benchmarks and building software from now on, so I'll write something again.
Recommended Posts