Introduction

I compared the speed of programs created with BASIC, C and assembler with IchigoJam, and tried to optimize for play. The program is a BASIC program that displays Mandelbrot set with IchigoJam [Creative Commons License](https://ja.wikipedia.org/ wiki /% E3% 82% AF% E3% 83% AA% E3% 82% A8% E3% 82% A4% E3% 83% 86% E3% 82% A3% E3% 83% 96% E3% 83% BB % E3% 82% B3% E3% 83% A2% E3% 83% B3% E3% 82% BA% E3% 83% BB% E3% 83% A9% E3% 82% A4% E3% 82% BB% E3 There is a person who is published at% 83% B3% E3% 82% B9), and I am using this because it was convenient for this comparison. The processing system of each language used is

language	Processing system	Edition	Public address
BASIC	IchigoJam BASIC	ver 1.2.1 for US keyboard	http://ichigojam.net/download.html
C	gcc	5.4.1	https://launchpad.net/gcc-arm-embedded
C	clang	3.9.1	http://releases.llvm.org/download.html
assembler	GNU Binutils(What comes with gcc)	2.26.2	https://launchpad.net/gcc-arm-embedded

I am using the above. BASIC Enter Original Source in IchigoJam, delete the 20th line, add the following line and execute it.

`python`


5 CLS:CLT
50 LC0,23:?TICK()

Measures the time from when the screen is erased until the Mandelbrot set is displayed. The time is measured by the function TICK () of IchigoJam BASIC, and the unit is about 1/60 second. The execution result is as follows. mandelbrot.basic.1.jpg The result is 31008, but since the unit is about 1/60 seconds, if you convert this to time, 31008 * 1/60 = 516.8, which means that it took about 8 minutes and 37 seconds. The author of this program is written in Article that "it takes about 15 minutes to draw everything", but at the time this article was written It seems that the execution speed of IchigoJam BASIC has been greatly improved by the version upgrade. C

Try gcc

The following is a port of the above BASIC program to C faithfully except for the size of the variable.

`mandelbrot.c`


#include <stdint.h>

int mandelbrot(int16_t arg, uint8_t* ram, const uint8_t* font)
{
    (void)arg; (void)font;
    int n = 32;
    int w = 64, h = 48;
    int a = 2 << 6, b = a * a, c = (3 << 12) / w;
    for (int y = 0; y < h; y++) {
        int v = (y - h / 2) * c;
        for (int x = 0; x < w; x++) {
            int u = (x - (w * 3) / 4) * c;
            uint8_t* z = &ram[0x900 + (x / 2) + (y / 2) * 32];
            int j = 1 << ((y % 2) * 2 + (x % 2));
            int k = j | 0x80 | *z;
            *z = k;
            int p = u, q = v, i = 0;
            do {
                int r = p / 64; p = r * r; if (p < 0) goto l40;
                int s = q / 64; q = s * s; if (q < 0) goto l40;
                if (p + q < 0 || p + q > b) goto l40;
                p = p - q + u;
                q = 2 * r * s + v;
            } while (++i < n);
            *z = k ^ j;
          l40:;
        }
    }
    return 0;
}

Compile this into machine language with the following Makefile and Perl script, convert it to BASIC comment form, and create mandelbrot.apo. The optimizing option of the compiler specifies `-Ofast'because it emphasizes speed.

`Makefile`


target: mandelbrot.apo

mandelbrot.s: mandelbrot.c
	arm-none-eabi-gcc -Wall -W -mcpu=cortex-m0 -mthumb -fpic -Ofast -S $<

mandelbrot.o: mandelbrot.s
	arm-none-eabi-as $< -o$@ -a=$(<:.s=.prn)

mandelbrot.x: mandelbrot.o
	arm-none-eabi-ld --entry 0x700 -Ttext 0x700 $< -o $@ -Map $(@:.x=.map)

mandelbrot.mot: mandelbrot.x
	arm-none-eabi-objcopy -O srec $< $@

mandelbrot.apo: mandelbrot.mot
	perl hex2apo.pl $< > $@

`hex2apo.pl`


#!/usr/bin/perl
$start = 0xffffffff;
$end = 0x00000000;
while(<>) {
    $S = substr($_, 0, 1);
    $type = substr($_, 1, 1);
    die if ($S != "S" || !($type >= 0 && $type <= 9));
    $size = hex(substr($_, 1 + 1, 2));
    for ($sum = 0, $i = 0; $i <= $size; $i++) {
        $sum += hex(substr($_, 1 + 1 + 2 * $i, 2));
    }
    die if (($sum & 0xff) != 0xff);
    if ($type >= 1 && $type <= 3) {
        $size = $size - 1 - ($type + 1);
        $address = hex(substr($_, 1 + 1 + 2, 2 * ($type + 1)));
        if ($start > $address) {
            $start = $address;
        }
        if ($end < $address + $size - 1) {
            $end = $address + $size - 1;
        }
        for ($i = 0; $i < $size; $i++) {
            $mem[$address + $i] = hex(substr($_, 1 + 1 + 2 + 2 * ($type + 1) + 2 * $i, 2));
        }
    }
}
$line = 10;
for ($i = $start; $i <= $end;) {
    printf("%d '", $line);
    for ($j = 0; $j < 15; $j++) {
        $n = 0;
        for ($k = 0; $k < 8; $k++) {
            $n = ($n << 8) & 0xff00;
            if ($k < 7) {
                $n += $mem[$i++];
            }
            &putc((($n >> ($k + 1)) & 0x7f));
            if ($i > $end) {
                if ($k < 7) {
                    &putc((($n << (6 - $k)) & 0x7f));
                }
                last;
            }
        }
        if ($i > $end) {
            last;
        }
    }
    printf("\n");
    $line += 1;
}

sub putc {
    my ($c) = @_;
    $c += 0x21;
    if ($c >= 0x7f) {
        $c += (0xa1 - 0x7f);
    }
    printf("%c", $c);
}

`mandelbrot.apo`


10 'Sa NL Ki Sz-eD4IekP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-㽬 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ･';3Ig#GmUEr-{DH)a*-Ga"ｩ g1%f-"23 js#Cga)jUQ'ｦ
11 'D'I#"1UdBra:1#G*C CDA%㽬!!i2iU9%Sae-1 a"Qxq ｧ&BZ,![A1D.9/12neA1!Oh!bBwiYOh､KfｶXA*|J.9D9Oh､4ｧUｳ!i-ca eX.)|DQ'ｧ"Oer ")seQ7X ”C!GQMm ”L
12 's key A#!?"].picture)K;egvr.,Ju J-"nu'39GｳｴZ"/Kochi!A`Tuu

The following is a merged version of a BASIC program that reads this, writes it to memory, executes it as a machine language program, and uses TICK () to measure and display the execution time.

`text:mandelbrot.gcc.1.bas`


1 ' Mandelbrot set(gcc)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:Z=USR(#700,0):LC0,23:?TICK()
10 'Sa NL Ki Sz-eD4IekP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-㽬 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ･';3Ig#GmUEr-{DH)a*-Ga"ｩ g1%f-"23 js#Cga)jUQ'ｦ
11 'D'I#"1UdBra:1#G*C CDA%㽬!!i2iU9%Sae-1 a"Qxq ｧ&BZ,![A1D.9/12neA1!Oh!bBwiYOh､KfｶXA*|J.9D9Oh､4ｧUｳ!i-ca eX.)|DQ'ｧ"Oer ")seQ7X ”C!GQMm ”L
12 's key A#!?"].picture)K;egvr.,Ju J-"nu'39GｳｴZ"/Kochi!A`Tuu

I entered this into IchigoJam and executed it as follows. mandelbrot.gcc.0.jpg The numerical value of the result is too small, and the magnitude of the value after the decimal point truncated at the time of measurement is too large as an error to evaluate the result, so change the program

`python`


6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()

Run 100 times and then evaluate the result. mandelbrot.gcc.1.jpg It says 293, but remove Z = USR (# 700,0) from the previous line

`python`


6 CLS:CLT:FORI=1TO100:NEXT:LC0,23:?TICK()

The result of executing only the loop 100 times without executing the machine language part is mandelbrot.100loop.jpg Since it is 9, (293-9) / 100 = 2.84 is the approximate execution time per execution.

Try clang

The line calling arm-none-eabi-gcc in the previous Makefile

	clang -Wall -W -target arm-eabi -mcpu=cortex-m0 -fpic -Ofast -S $<

You can change the C compiler to use to clang by changing to. The optimization option specifies `-Ofast'as with gcc. The BASIC program that incorporates the code generated by clang is as follows.

`text:mandelbrot.clang.1.bas`


1 ' Mandelbrot set(clang)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'NRl!'C!1"S!C<!A#E-%:V3a-#s/)^c9&!D o=㽬'*2g ”N!"(153bE!#JB.3!!5a,>Key d-2cIA$mB$!-"'~!jw5<EU#nCD!}{=a%FgU=!c:Bb o 1+u*:%ｨN59㽬8j
11 '-!$1Ua=g&H!z i mP<(]ｦ*ye]g\1 So^So>ｧ-ｬ ・ 4 ・:idq'BZui(1B$\(!"-<Pq i!!KV%-<Dc"Xl|%G!Shi k&!'m!O!m{=A,X-)#Y/-&kGe!A'M?,Oh$~ﾂﾂﾂ｢ﾂｾSo"}
12 'Twip Tachi"!!!!

If you do this, the result will be mandelbrot.clang.1.jpg It became 307. , (307-9) / 100 = 2.98 is the approximate execution time per run. Unfortunately for the clang sect, this test is slightly (about 5%) inferior to gcc.

Try manual optimization

In this Makefile, the assembler source is output once at the time of build, assembled and linked. The contents of the assembler source generated in the case of the previous gcc are as follows.

`mandelbrot.s`


	.syntax unified
	.cpu cortex-m0
	.fpu softvfp
	.eabi_attribute 23, 1
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 1
	.eabi_attribute 30, 2
	.eabi_attribute 34, 0
	.eabi_attribute 18, 4
	.thumb
	.syntax unified
	.file	"mandelbrot.c"
	.text
	.align	2
	.global	mandelbrot
	.code	16
	.thumb_func
	.type	mandelbrot, %function
mandelbrot:
	@ args = 0, pretend = 0, frame = 24
	@ frame_needed = 0, uses_anonymous_args = 0
	push	{r4, r5, r6, r7, lr}
	mov	r7, fp
	mov	r6, r10
	mov	r4, r8
	mov	r5, r9
	ldr	r3, .L12
	push	{r4, r5, r6, r7}
	mov	r8, r3
	movs	r3, #0
	sub	sp, sp, #28
	str	r3, [sp, #20]
	movs	r3, #128
	lsls	r3, r3, #7
	movs	r7, #63
	mov	r10, r3
	str	r1, [sp, #16]
.L6:
	movs	r1, #1
	ldr	r2, [sp, #20]
	ldr	r6, .L12+4
	asrs	r3, r2, #1
	lsls	r3, r3, #5
	ands	r1, r2
	str	r3, [sp, #8]
	lsls	r3, r1, #1
	str	r3, [sp, #12]
	movs	r3, #0
	mov	fp, r3
.L5:
	movs	r2, #144
	mov	r3, fp
	lsls	r2, r2, #4
	mov	ip, r2
	ldr	r2, [sp, #8]
	asrs	r3, r3, #1
	add	r3, r3, ip
	mov	ip, r2
	ldr	r2, [sp, #16]
	add	r3, r3, ip
	mov	ip, r2
	mov	r2, fp
	add	ip, ip, r3
	movs	r3, #1
	ands	r3, r2
	ldr	r2, [sp, #12]
	movs	r1, r6
	mov	r9, r2
	movs	r2, #1
	add	r3, r3, r9
	lsls	r2, r2, r3
	mov	r9, r2
	movs	r2, #128
	mov	r3, r9
	orrs	r2, r3
	mov	r3, ip
	ldrb	r3, [r3]
	movs	r4, #32
	orrs	r2, r3
	mov	r3, r8
	str	r2, [sp, #4]
	str	r3, [sp]
	b	.L4
.L2:
	subs	r1, r1, r2
	ldr	r2, [sp]
	lsls	r0, r0, #1
	mov	r8, r2
	muls	r3, r0
	subs	r4, r4, #1
	adds	r1, r6, r1
	add	r3, r3, r8
	cmp	r4, #0
	beq	.L11
.L4:
	asrs	r0, r1, #31
	asrs	r2, r3, #31
	ands	r0, r7
	ands	r2, r7
	adds	r1, r0, r1
	adds	r3, r2, r3
	asrs	r0, r1, #6
	asrs	r3, r3, #6
	movs	r1, r0
	movs	r2, r3
	muls	r1, r0
	muls	r2, r3
	adds	r5, r1, r2
	cmp	r5, r10
	ble	.L2
	ldr	r3, [sp]
	mov	r2, sp
	mov	r8, r3
	mov	r3, ip
	ldrb	r2, [r2, #4]
	strb	r2, [r3]
.L3:
	movs	r3, #1
	mov	ip, r3
	add	fp, fp, ip
	mov	r3, fp
	adds	r6, r6, #192
	cmp	r3, #64
	bne	.L5
	movs	r2, #192
	mov	ip, r2
	ldr	r3, [sp, #20]
	add	r8, r8, ip
	adds	r3, r3, #1
	str	r3, [sp, #20]
	cmp	r3, #48
	bne	.L6
	movs	r0, #0
	add	sp, sp, #28
	@ sp needed
	pop	{r2, r3, r4, r5}
	mov	r8, r2
	mov	r9, r3
	mov	r10, r4
	mov	fp, r5
	pop	{r4, r5, r6, r7, pc}
.L11:
	mov	r2, r9
	ldr	r3, [sp, #4]
	eors	r3, r2
	mov	r2, ip
	strb	r3, [r2]
	b	.L3
.L13:
	.align	2
.L12:
	.word	-4608
	.word	-9216
	.size	mandelbrot, .-mandelbrot
	.ident	"GCC: (GNU Tools for ARM Embedded Processors) 5.4.1 20160919 (release) [ARM/embedded-5-branch revision 240496]"

Since `-Ofast'is specified in the optimization option, it is assumed that the code that emphasizes execution speed is spit out, but after all it is due to the modern compiler, so there is a wasteful part. One of the theories of speeding up is "focus on the inside of the loop." The C program this time has a triple loop structure, but the innermost loop part

`python`


            int p = u, q = v, i = 0;
            do {
                int r = p / 64; p = r * r; if (p < 0) goto l40;
                int s = q / 64; q = s * s; if (q < 0) goto l40;
                if (p + q < 0 || p + q > b) goto l40;
                p = p - q + u;
                q = 2 * r * s + v;
            } while (++i < n);

I will consider speeding up only for. The innermost loop part is the upper assembler source

`python`


	movs	r4, #32
	orrs	r2, r3
	mov	r3, r8
	str	r2, [sp, #4]
	str	r3, [sp]
	b	.L4
.L2:
	subs	r1, r1, r2
	ldr	r2, [sp]
	lsls	r0, r0, #1
	mov	r8, r2
	muls	r3, r0
	subs	r4, r4, #1
	adds	r1, r6, r1
	add	r3, r3, r8
	cmp	r4, #0
	beq	.L11
.L4:
	asrs	r0, r1, #31
	asrs	r2, r3, #31
	ands	r0, r7
	ands	r2, r7
	adds	r1, r0, r1
	adds	r3, r2, r3
	asrs	r0, r1, #6
	asrs	r3, r3, #6
	movs	r1, r0
	movs	r2, r3
	muls	r1, r0
	muls	r2, r3
	adds	r5, r1, r2
	cmp	r5, r10
	ble	.L2

It has been converted to the code. In the C source, the process of getting out of the loop with goto when the variables p and q become negative was a countermeasure against the multiplication overflow that can occur because the variable size is 16 bits in the original BASIC program. is. When porting to C, the size of the variable is 32bit, and the judgment of overflow should not be logically necessary, but it is incorporated for the purpose of faithfully porting from the BASIC program. I was surprised that the output code of the above compiler had that part removed cleanly by optimization. [Added 2017.9.3] C language does not consider signed integer overflow, so the code below

                int r = p / 64; p = r * r; if (p < 0) goto l40;
                int s = q / 64; q = s * s; if (q < 0) goto l40;

Negative results of squared r and s are not considered, and it seems that the optimization removed them as "useless processing".

Besides, as a quick look

Flow is not efficient
There are many mov instructions, and there seems to be useless assignment.
Slow reading memory on stack frame in loop
R7 is used to mask the sign bit when reducing the signed value to 1/64, but not required for instruction combinations
After decrementing the loop counter, it is compared with # 0, but the comparison itself is useless because the flag changes with decrementing.

I was concerned about the above points. The previous part

`python`


	movs	r4, #32
	orrs	r2, r3
	mov	r3, r8
	str	r2, [sp, #4]
	str	r3, [sp]
	movs	r7, r3
.L4:
	asrs	r0, r1, #6
	asrs	r2, r3, #6
	lsrs	r0, r0, #26
	lsrs	r2, r2, #26
	adds	r1, r0, r1
	adds	r3, r2, r3
	asrs	r1, r1, #6
	asrs	r3, r3, #6
	movs	r0, r1
	muls	r0, r3
	muls	r1, r1
	muls	r3, r3
	adds	r5, r1, r3
	cmp	r5, r10
	bgt	1f
	subs	r1, r1, r3
	lsls	r3, r0, #1
	adds	r1, r6, r1
	adds	r3, r7
	subs	r4, r4, #1
	bne	.L4
	b	.L11
1:

By changing to, these points will be resolved. The following is a build of mandelbrot.s with this modification and converted to a BASIC program.

`text:mandelbrot.gcc.opt.bas`


1 ' Mandelbrot set(hand optimized)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NL Ki Sz-eD4I ｦ cP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-ji 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ･';3Ig#GmUEr-{DH)a*-Ga"ｩ g1%f-"23 js#Cga)jUQ'ｦ
11 'D'I#"1UdBra:1#G@!C#:qg!/j$i2gmR*)ｩ ・ 1a"QdEq YO U?UQvCb|mi'!Yg<Ri%Shu h=!%ー ugm2mUQj;Y!EZ ・<geNry$S!x\Wow%JC9,<C2!497GQ6;1"!0Q
12 '?P3%U.CDK&5g!U U$*-4YY}So#9""Atsu".Echi

The result of doing this is mandelbrot.gcc.opt.jpg It became 253. (253-9) / 100 = 2.44 is the approximate time for each session. What was 2.84 before manual optimization is now 2.44, which is about 16.4% faster. I think the effect was great for the optimization performed in a limited range.

Try to make small changes to the logic

In the previous C program, the address of VRAM is calculated for each dot drawn. There are 4 dots in one address of VRAM, but it is useless to calculate the same address 4 times, so let's think about improvement. This point should be mitigated by calculating the four dots that exist at the same VRAM address in succession. By taking advantage of the continuous VRAM address, the VRAM address calculation itself can be omitted. The following is an application of this to the C program.

`mandelbrot.c`


#include <stdint.h>

int mandelbrot(int16_t arg, uint8_t* ram, const uint8_t* font)
{
    (void)arg; (void)font;
    int n = 32;
    int w = 64, h = 48;
    int a = 2 << 6, b = a * a, c = (3 << 12) / w;
    uint8_t* z = ram + 0x900;
    int v = (-h / 2) * c;
    for (int y = 0; y < h; y += 2) {
        int u = -((w * 3) / 4) * c;
        for (int x = 0; x < w; x += 2) {
            #define mandelbrot2(u, v) ({ \
                int p = u; \
                int q = v; \
                int a = 0; \
                int i = 0; \
                do { \
                    int r = p / 64; p = r * r; \
                    int s = q / 64; q = s * s; \
                    if (p + q < 0 || p + q > b) { \
                        a = 1; \
                        break; \
                    } \
                    p = p - q + u; \
                    q = 2 * r * s + v; \
                } while (++i < n); \
                (a); \
            })
            int k = 0x80;
            if (mandelbrot2(u, v)) {
                k += 1;
            }
            v += c;
            if (mandelbrot2(u, v)) {
                k += 4;
            }
            u += c;
            if (mandelbrot2(u, v)) {
                k += 8;
            }
            v -= c;
            if (mandelbrot2(u, v)) {
                k += 2;
            }
            *z++ = k;
            u += c;
        }
        v += 2 * c;
    }
    return 0;
}

The VRAM address is set first and only incremented every 4 dots written. The following is a compilation of this with gcc and a BASIC program.

`mandelbrot.gcc.2.bas`


1 ' Mandelbrot set(gcc)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'i)¡,KV-eD8 ㆍ S9E"&a3eXc 㽦 CA!-ngl@j45W!glCH)"--Ao!E UR!apDM)"p"3!Fq 㽹 Bs*)"=&a.y ｧ!I!UU(iOCAED>q}ea gg22-C/!'a%O.Ud{g65U!l ー
11 '!I.?Ke#DD2x5U,-ｦ D-A#"!0 "`a0U=EU"=CD7;1!I+U:E_OCA*Oh(Sa G9 "eTR"*ｧ!;!?IVs/;9v2q#PU$KC\qB/UgOh2'Cormorant"!fa<㽩;9 o!,!!Chi#!!dB ･ DA*Set 3:
12 'Rr!#Ka 4Bi,C o#!eaAg;ReAD<)aA"q"#dN1 erfN.!3 s A;{8l~%)Uu U.qC)%E{9 o!z2j ｧ V-8!"So%!"'dega4 su F*<-a&Ka Wd18 ー&ｧ%"*!b-U ｦ)agW2!a$A#&'{
13 'B'D+{;!Fcaa VUP8[#2IS;Ugo 2)EQ#X"<'o!*m"4 ")Tuka Q$?B!;'(=$<1-":?!g|!H)㽬"0ceNaT5I$Ao mosquito a%!>a shi]i2t5V*.LD]8|+:Mg A Z#;:.ZDG)ｩ p>Mg
14 'xZ service%Q$~ﾂ ﾂ a<p

The result of execution is mandelbrot.gcc.2.jpg (269-9) / 100 = 2.60, which is an improvement of about 9% compared to 2.84 before changing the logic. It's not as good as the manual optimization above, but I think it's a small effect as an easy method. By the way, if you build with clang, the program will

`mandelbrot.clang.2.bas`


1 ' Mandelbrot set(clang)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NR\!Ea!"#$!"&Mk"EA*)CFa:"O FCI$1"GAj)7A"1A!Qua&>%GB:1z.i,"""a･#2&2kE@aP!(-$)Aemg\1 ke"e%g+p a)Aw"49a'q, kR'"!#45),{=a)ｧ#B 0W!
11 '3cTA]Set A8a$c"e%C+D5i`!¡!/9'1b*"-aB, d!u chiso c&W":IS3.4)W!!0:D'.r,!|Ea]!B!'d%ｩ 9$!!m'i!Different "]ﾂ A8a$c"e1CCD5i`a¡!/='1r*B-aB, ｦ!u ｩ?c
12 '&W!KiQ3.:IWa!0:Dg/,,!~Ea]!b!'d"ｧ ”#Q;[)n[#Sa!Yq9c%EuTV(q,"",!･!2"2kE;eS$4ﾂ"1uｱSc&Ke!ElER..9)U!!2:B'/,,!~Aka 18!%A"ｩ A597Q"
13 'Y#TN1*:eh)"･!s!-;=Wee!:"2a*mEY"m)'$"E"{L Shi "#!O, U.)"(\ﾂ ﾂ!X@Tasso d Tata`aA!!!

And the result of execution is mandelbrot.clang.2.jpg The result was (294-9) / 100 = 2.85. It's better than the previous example 2.98 in clang, but it's about 4% better, not as much as in gcc.

assembler

I wrote the program in assembler based on the experience of manual optimization and logic change.

`mandelbrot.s`


	.syntax unified
	.cpu cortex-m0
	.thumb

	.macro	mandelbrot2 bit
	mov	r2, r8
	mov	r3, r9
	movs	r4, #n
0:
	asrs	r0, r2, #6
	lsrs	r0, r0, #26
	adds	r2, r2, r0
	asrs	r2, r2, #6
	asrs	r0, r3, #6
	lsrs	r0, r0, #26
	adds	r3, r3, r0
	asrs	r3, r3, #6
	movs	r1, r2
	muls	r1, r3
	muls	r2, r2
	muls	r3, r3
	adds	r0, r2, r3
	cmp	r0, r10
	bgt	9f
	subs	r2, r2, r3
	add	r2, r8
	lsls	r3, r1, #1
	add	r3, r9
	subs	r4, #1
	bne	0b
	subs	r5, #\bit
9:
	.endm

	.text
	.align	1
	.global	mandelbrot
mandelbrot:
	.equiv	vramoffset, 0x900
	.equiv	n, 32
	.equiv	w, 64
	.equiv	h, 48
	.equiv	a, 2<<6
	.equiv	b, a*a
	.equiv	c, (3<<12)/w
	push	{r4, r5, r6, r7, lr}
	mov	r3, r8
	mov	r4, r9
	mov	r5, r10
	mov	r6, r11
	mov	r7, r12
	push	{r3, r4, r5, r6, r7}

	movs	r0, #vramoffset>>4
	lsls	r0, r0, #4
	adds	r6, r0, r1

	movs	r0, #c
	mov	r12, r0
	rsbs	r0, #0
	mov	r14, r0

	movs	r0, #b>>12
	lsls	r0, r0, #12
	mov	r10, r0

	ldr	r7, uend
	ldr	r0, vend
	mov	r11, r0

	ldr	r0, vstart
	mov	r9, r0
yloop:
	ldr	r0, ustart
	mov	r8, r0
xloop:
	movs	r5, #0x8f
	mandelbrot2 1
	add	r9, r12
	mandelbrot2 4
	add	r8, r12
	mandelbrot2 8
	add	r9, r14
	mandelbrot2 2
	add	r8, r12

	strb	r5, [r6]
	adds	r6, #1
	cmp	r7, r8
	bne	xloop

	add	r9, r12
	add	r9, r12
	cmp	r11, r9
	bne	yloop

	pop	{r3, r4, r5, r6, r7}
	mov	r8, r3
	mov	r9, r4
	mov	r10, r5
	mov	r11, r6
	mov	r12, r7
	movs	r0, #0
	pop	{r4, r5, r6, r7, pc}

	.align	2
vstart:
	.word	(-h/2)*c
vend:
	.word	(h/2)*c
ustart:
	.word	-((w*3)/4)*c
uend:
	.word	((w*(4-3))/4)*c

	.end

As a result of trying to make the best use of registers, it is no longer necessary to put data in the stack frame that existed in the output code of the C compiler. The following is a BASIC program.

`mandelbrot.asm.1.bas`


1 ' Mandelbrot set(asm)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NIUSR-vD8i[<E Ui)!!+9Ra1B)ｧ#+-g#)!!=*-dHq*);9 Soiarh key".03qi ｧ{9aEi%Q!W iR3)ｩ#9!YW9neC1#I'sBw^!, Af$X;Bs*)l!3*Ka vr!p]5C*-l
11 'D)%j!g!/*'3B-"D!('ｦ*yeC!MQE;-(q-5)qW Ka E;B2*Q#N)"?[{1B co aeB2jUR!j1)"", 1 ･ E2m%Q!Wu R<)eA&k.EdNq"cb+'UdE327!Fr!4 LC)?So Ie3:7g
12 '1*3"-!=3-Ec:ag!/.g4R)e!zBuiV{0A9I2A-iUcC3 "%{1#]UAD")VY!G[7Fr ur=5GN,-Ochi 8jc;cgv2wUV-!1]8}#9""Atsu""1!!!A`Tuu!Q!!

The execution result is mandelbrot.asm.1.jpg (215-9) / 100 = 2.06, which is the fastest result so far.

Summary

The table below shows the above results.

Processing system	program	processing time(The unit is about 1/60 seconds)	processing time最短のものと比べての倍率
IchigoJam BASIC	original	31008	15052.43
gcc 5.4.1	Faithful porting from BASIC programs	2.84	1.38
clang 3.9.1	Faithful porting from BASIC programs	2.98	1.45
gcc 5.4.1 +Manual optimization	Faithful porting from BASIC programs	2.44	1.18
gcc 5.4.1	Logic minor changes	2.60	1.26
clang 3.9.1	Logic minor changes	2.85	1.38
GNU Binutils 2.26.2		2.06	1.00

Summary,

Compared to IchigoJam BASIC, machine language programs built with C and assembler are very fast.
Comparing gcc and clang, gcc was faster in this test.
Hand-coded code was better in this test than gcc or clang-generated code

That's it. In thinking about improving the execution speed of the program

Devise logic
Use assembler

Etc. are annoying, but you should decide by considering the balance between the cost and effect required for coding and maintenance of the program. This time it's a complete play, so the cost is out of consideration. If you write the program in a high-level language such as C, it is easy to devise the logic for speeding up, and conversely, once you write it in assembler, it will be difficult to change the logic, so first of all, C etc. Writing in a high-level language and devising logic seems to be an effective means of optimization. The gcc output code I saw this time was clearly useless. I would like to expect further improvements to gcc and clang.

I came up with a further speedup, so I added it

About the division by 64 where the variables r and s are calculated in the C program

`python`


                int r = p / 64;
                int s = q / 64;

In the gcc output code

`python`


.L4:
    asrs    r0, r1, #31
    asrs    r2, r3, #31
    ands    r0, r7
    ands    r2, r7
    adds    r1, r0, r1
    adds    r3, r2, r3
    asrs    r0, r1, #6
    asrs    r3, r3, #6

The sign bit of the variables p and q is extended to 32bit, masked with the value of r7 (63), added to the original value, and 6bit arithmetic right-shifted. Although the assembler version of the program uses registers differently, the idea is the same.

`python`


0:
    asrs    r0, r2, #6
    lsrs    r0, r0, #26
    adds    r2, r2, r0
    asrs    r2, r2, #6
    asrs    r0, r3, #6
    lsrs    r0, r0, #26
    adds    r3, r3, r0
    asrs    r3, r3, #6

This is a correction to a negative value when dividing a signed value by a power of 2 to make it converge to 0 regardless of whether the dividend is positive or negative. Simple arithmetic Using only right shift will converge to 0 for positive values, but will converge to -1 for negative values, so we are taking measures against that. In this C program, the size of the variable has been changed from 16bit to 32bit from the original BASIC program, and since there is a margin in accuracy, the range of values that can be taken for the variables p and q in the program is 15bit on the MSB side. Always have the same value. If you use this, the upper part of the previous assembler program will be

`python`


0:
    lsrs    r0, r2, #26
    adds    r2, r2, r0
    asrs    r2, r2, #6
    lsrs    r0, r3, #26
    adds    r3, r3, r0
    asrs    r3, r3, #6

It is possible to write. The following is the version converted into a BASIC program after making this change.

`text:mandelbrot.asm.2.bas`


1 ' Mandelbrot set(asm)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NIUSR-vD8i[<E Ui)!!+9Ra1B)ｧ#+-g#)!!=*-`Hp*);9 key iarhUe".03qi ｧ{9aEi$cBeiD9('ｦ*yeC!MQE;-(q-5)qW Ka E;B2*Q#N)"?\;1*Co ceB2jUR!j1
11 '(%d*1gQ/.g4R)e!zBuiV{0A9I2A-iUcC3 "%{1#]UAd")cD3"3A/*'3B-"=<-G ・ 2)"SdJ1 YO U?aQqCb|n1%eFa*UA%Shoc("+2%gFre#Ea=3-Ec:aYW9
12 'neC1#I'sBw^!, Af$X;Bs*)l!3*Ka zr"0]%Bv ""<2 ozwhce ur:UMtD service. G)Shu+;ugzrxeQ!b u. u.)"(\ﾂ ﾂ!%a!!$Zﾂ ﾂa"a!!

Doing this resulted in (198-9) / 100 = 1.89 (image capture omitted). The following is an updated version of the summary table.

Processing system	program	processing time(The unit is about 1/60 seconds)	processing time最短のものと比べての倍率
IchigoJam BASIC	original	31008	16406.35
gcc 5.4.1	Faithful porting from BASIC programs	2.84	1.50
clang 3.9.1	Faithful porting from BASIC programs	2.98	1.58
gcc 5.4.1 +Manual optimization	Faithful porting from BASIC programs	2.44	1.29
gcc 5.4.1	Logic minor changes	2.60	1.38
clang 3.9.1	Logic minor changes	2.85	1.51
GNU Binutils 2.26.2		2.06	1.09
GNU Binutils 2.26.2	Speed up division	1.89	1.00

BASIC and C and assembler speed comparison and optimization play with IchigoJam

Introduction

python

Try gcc

mandelbrot.c

Makefile

hex2apo.pl

mandelbrot.apo

text:mandelbrot.gcc.1.bas

python

python

Try clang

text:mandelbrot.clang.1.bas

Try manual optimization

mandelbrot.s

python

python

python

text:mandelbrot.gcc.opt.bas

Try to make small changes to the logic

mandelbrot.c

mandelbrot.gcc.2.bas

mandelbrot.clang.2.bas

assembler

mandelbrot.s

mandelbrot.asm.1.bas

Summary

I came up with a further speedup, so I added it

python

python

python

python

text:mandelbrot.asm.2.bas

`python`

`mandelbrot.c`

`Makefile`

`hex2apo.pl`

`mandelbrot.apo`

`text:mandelbrot.gcc.1.bas`

`python`

`python`

`text:mandelbrot.clang.1.bas`

`mandelbrot.s`

`python`

`python`

`python`

`text:mandelbrot.gcc.opt.bas`

`mandelbrot.c`

`mandelbrot.gcc.2.bas`

`mandelbrot.clang.2.bas`

`mandelbrot.s`

`mandelbrot.asm.1.bas`

`python`

`python`

`python`

`python`

`text:mandelbrot.asm.2.bas`