GPGPU with Raspberry Pi

I have created a Python library called ** PyVideoCore ** for GPGPU on Raspberry Pi, so I will introduce it.

GPU of Raspberry Pi

The Raspberry Pi series is equipped with Broadcom's ** VideoCore IV ** mobile GPU. This GPU is officially published in the Reference Guide (https://docs.broadcom.com/docs/12358545). This was published by Broadcom as a birthday present to the Raspberry Pi Foundation in February 2014. Thanks to this document, it is possible to hack VideoCore.

VideoCore IV is equipped with 12 Quad Processing Units (** QPU **). Each QPU is a 16-way SIMD processor that performs 4 words x 4 cycles (that is, 16 lengths) of vector calculations with one instruction. Each QPU can execute two operations of addition system and integration system at the same time. In other words, the entire GPU can perform up to 12 x 4 x 2 = 96 operations at the same time. Since the clock is 250MHz, the theoretical performance is 96 x 0.25 = 24GFLOPS. Only single precision. Raspberry Pi 2 seems to be able to overclock up to about 500MHz.

In addition, there are three Special Function Units (SFU) independent of ALU, and RECIP, RECIPSQRT, EXP2, and LOG2 can be calculated. However, it uses 4 instructions (16 cycles), cannot be pipelined, and has poor accuracy (I have not experimented properly, but it seems that LOG2 is about 4 digits except for LOG2), so it contributes to the computing power of SFU. Is minute. Each QPU can run up to two hardware threads. In other words, you can run up to 24 threads at the same time. The assignment of threads to the QPU is done dynamically by the VideoCore scheduler. There is one mutex and 16 semaphores for synchronizing threads. There are several types of memory depending on how you use it, but it will be longer, so I will explain it later.

It's almost like this. The performance is not high because it is for mobile, but I think that it will be interesting to play with it because you can get the complete document, you can buy it for several thousand yen, and GPGPU other than NVIDIA is rare. Of course, if you are doing some kind of project with a smartphone equipped with Raspberry Pi or VideoCore IV, I would appreciate this computing power.

PyVideoCore

Unfortunately, VideoCore IV does not (probably) have a GPGPU development environment like CUDA or OpenCL, so you need to develop in assembly language for QPU. In the first place, there is no language or assembler. In the past, the following projects have been carried out, but it seems that each of them is developing their own assembler.

-Implementation of FFT -SHA256 implementation -Porting Deep Belief Image recognition SDK (Matrix multiplication (GEMM) is fast on GPU )

** PyVideoCore ** tried to implement assembly language as Python's internal DSL to make it a little easier to write. The following is a sample that just adds a 16 length float vector, but you can write the host side code and GPU side code in one file and execute it as a normal Python script without compiling.

import numpy as np

from videocore.assembler import qpu
from videocore.driver import Driver

@qpu
def hello_world(asm):
    # Load two vectors of length 16 from the host memory (address=uniforms[0]) to VPM
    setup_dma_load(nrows=2)
    start_dma_load(uniform)
    wait_dma_load()

    # Setup VPM read/write operaitons
    setup_vpm_read(nrows=2)
    setup_vpm_write()

    # Compute a + b
    mov(r0, vpm)
    mov(r1, vpm)
    fadd(vpm, r0, r1)

    # Store the result vector from VPM to the host memory (address=uniforms[1])
    setup_dma_store(nrows=1)
    start_dma_store(uniform)
    wait_dma_store()

    # Finish the thread
    exit()

with Driver() as drv:
    # Input vectors
    a = np.random.random(16).astype('float32')
    b = np.random.random(16).astype('float32')

    # Copy vectors to shared memory for DMA transfer
    inp = drv.copy(np.r_[a, b])
    out = drv.alloc(16, 'float32')

    # Run the program
    drv.execute(
            n_threads=1,
            program=drv.program(hello_world),
            uniforms=[inp.address, out.address]
            )

    print ' a '.center(80, '=')
    print(a)
    print ' b '.center(80, '=')
    print(b)
    print ' a+b '.center(80, '=')
    print(out)
    print ' error '.center(80, '=')
    print(np.abs(a+b-out))

The assembly code has a decorator called @ qpu. Currently, it is necessary to write a raw assembly, but the GPU code itself is a normal function, and since each instruction is also an ordinary function, it is possible to devise a library of frequently used patterns using Python functions. think.

Below is the repository. try it.

I'm thinking of taking benchmarks and building software from now on, so I'll write something again.

Recommended Posts

GPGPU with Raspberry Pi
DigitalSignage with Raspberry Pi
Mutter plants with Raspberry Pi
[Raspberry Pi] Stepping motor control with Raspberry Pi
Use vl53l0x with Raspberry Pi (python)
Servo motor control with Raspberry Pi
Serial communication with Raspberry Pi + PySerial
OS setup with Raspberry Pi Imager
Try L Chika with raspberry pi
VPN server construction with Raspberry Pi
Try moving 3 servos with Raspberry Pi
Using a webcam with Raspberry Pi
Raspberry Pi backup
Measure SIM signal strength with Raspberry Pi
Pet monitoring with Rekognition and Raspberry pi
Hello World with Raspberry Pi + Minecraft Pi Edition
Build a Tensorflow environment with Raspberry Pi [2020]
Get BITCOIN LTP information with Raspberry PI
Try fishing for smelt with Raspberry Pi
Programming normally with Node-RED programming on Raspberry Pi 3
Improved motion sensor made with Raspberry Pi
Try Object detection with Raspberry Pi 4 + Coral
Power SG-90 servo motor with raspberry pi
Working with sensors on Mathematica on Raspberry Pi
Use PIR motion sensor with raspberry Pi
Make a wash-drying timer with a Raspberry Pi
Infer Custom Vision model with Raspberry Pi
Operate an oscilloscope with a Raspberry Pi
Create a car meter with raspberry pi
Inkbird IBS-TH1 value logged with Raspberry Pi
Working with GPS on Raspberry Pi 3 Python
Discord bot with python raspberry pi zero with [Notes]
Media programming with Raspberry Pi (preparation for audio)
What is Raspberry Pi?
pigpio on Raspberry pi
I tried L-Chika with Raspberry Pi 4 (Python edition)
Raspberry Pi video camera
Enjoy electronic work with GPIO on Raspberry Pi
MQTT RC car with Arduino and Raspberry Pi
Raspberry Pi Bad Knowledge
Let's do Raspberry Pi?
Power on / off your PC with raspberry pi
Use Majoca Iris elongated LCD with Raspberry Pi
CSV output of pulse data with Raspberry Pi (CSV output)
Observe the Geminids meteor shower with Raspberry Pi 4
Get CPU information of Raspberry Pi with Python
Play with your Ubuntu desktop on your Raspberry Pi 4
Raspberry Pi 4 setup memo
Get temperature and humidity with DHT11 and Raspberry Pi
Cython on Raspberry Pi
Stock investment analysis app made with Raspberry Pi
Logging Inkbird IBS-TH1 mini values with Raspberry Pi
Connect to MySQL with Python on Raspberry Pi
Raspberry Pi system monitoring
GPS tracking with Raspberry Pi 4B + BU-353S4 (Python)
Measure CPU temperature of Raspberry Pi with Python
Record temperature and humidity with systemd on Raspberry Pi
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
Run LEDmatrix interactively with Raspberry Pi 3B + on Slackbot
Using the digital illuminance sensor TSL2561 with Raspberry Pi
Easy IoT to start with Raspberry Pi and MESH