I compared the speed of programs created with BASIC, C and assembler with IchigoJam, and tried to optimize for play. The program is a BASIC program that displays Mandelbrot set with IchigoJam [Creative Commons License](https://ja.wikipedia.org/ wiki /% E3% 82% AF% E3% 83% AA% E3% 82% A8% E3% 82% A4% E3% 83% 86% E3% 82% A3% E3% 83% 96% E3% 83% BB % E3% 82% B3% E3% 83% A2% E3% 83% B3% E3% 82% BA% E3% 83% BB% E3% 83% A9% E3% 82% A4% E3% 82% BB% E3 There is a person who is published at% 83% B3% E3% 82% B9), and I am using this because it was convenient for this comparison. The processing system of each language used is
language | Processing system | Edition | Public address |
---|---|---|---|
BASIC | IchigoJam BASIC | ver 1.2.1 for US keyboard | http://ichigojam.net/download.html |
C | gcc | 5.4.1 | https://launchpad.net/gcc-arm-embedded |
C | clang | 3.9.1 | http://releases.llvm.org/download.html |
assembler | GNU Binutils(What comes with gcc) | 2.26.2 | https://launchpad.net/gcc-arm-embedded |
I am using the above. BASIC Enter Original Source in IchigoJam, delete the 20th line, add the following line and execute it.
python
5 CLS:CLT
50 LC0,23:?TICK()
Measures the time from when the screen is erased until the Mandelbrot set is displayed. The time is measured by the function TICK () of IchigoJam BASIC, and the unit is about 1/60 second. The execution result is as follows. The result is 31008, but since the unit is about 1/60 seconds, if you convert this to time, 31008 * 1/60 = 516.8, which means that it took about 8 minutes and 37 seconds. The author of this program is written in Article that "it takes about 15 minutes to draw everything", but at the time this article was written It seems that the execution speed of IchigoJam BASIC has been greatly improved by the version upgrade. C
The following is a port of the above BASIC program to C faithfully except for the size of the variable.
mandelbrot.c
#include <stdint.h>
int mandelbrot(int16_t arg, uint8_t* ram, const uint8_t* font)
{
(void)arg; (void)font;
int n = 32;
int w = 64, h = 48;
int a = 2 << 6, b = a * a, c = (3 << 12) / w;
for (int y = 0; y < h; y++) {
int v = (y - h / 2) * c;
for (int x = 0; x < w; x++) {
int u = (x - (w * 3) / 4) * c;
uint8_t* z = &ram[0x900 + (x / 2) + (y / 2) * 32];
int j = 1 << ((y % 2) * 2 + (x % 2));
int k = j | 0x80 | *z;
*z = k;
int p = u, q = v, i = 0;
do {
int r = p / 64; p = r * r; if (p < 0) goto l40;
int s = q / 64; q = s * s; if (q < 0) goto l40;
if (p + q < 0 || p + q > b) goto l40;
p = p - q + u;
q = 2 * r * s + v;
} while (++i < n);
*z = k ^ j;
l40:;
}
}
return 0;
}
Compile this into machine language with the following Makefile and Perl script, convert it to BASIC comment form, and create mandelbrot.apo. The optimizing option of the compiler specifies `-Ofast'because it emphasizes speed.
Makefile
target: mandelbrot.apo
mandelbrot.s: mandelbrot.c
arm-none-eabi-gcc -Wall -W -mcpu=cortex-m0 -mthumb -fpic -Ofast -S $<
mandelbrot.o: mandelbrot.s
arm-none-eabi-as $< -o$@ -a=$(<:.s=.prn)
mandelbrot.x: mandelbrot.o
arm-none-eabi-ld --entry 0x700 -Ttext 0x700 $< -o $@ -Map $(@:.x=.map)
mandelbrot.mot: mandelbrot.x
arm-none-eabi-objcopy -O srec $< $@
mandelbrot.apo: mandelbrot.mot
perl hex2apo.pl $< > $@
hex2apo.pl
#!/usr/bin/perl
$start = 0xffffffff;
$end = 0x00000000;
while(<>) {
$S = substr($_, 0, 1);
$type = substr($_, 1, 1);
die if ($S != "S" || !($type >= 0 && $type <= 9));
$size = hex(substr($_, 1 + 1, 2));
for ($sum = 0, $i = 0; $i <= $size; $i++) {
$sum += hex(substr($_, 1 + 1 + 2 * $i, 2));
}
die if (($sum & 0xff) != 0xff);
if ($type >= 1 && $type <= 3) {
$size = $size - 1 - ($type + 1);
$address = hex(substr($_, 1 + 1 + 2, 2 * ($type + 1)));
if ($start > $address) {
$start = $address;
}
if ($end < $address + $size - 1) {
$end = $address + $size - 1;
}
for ($i = 0; $i < $size; $i++) {
$mem[$address + $i] = hex(substr($_, 1 + 1 + 2 + 2 * ($type + 1) + 2 * $i, 2));
}
}
}
$line = 10;
for ($i = $start; $i <= $end;) {
printf("%d '", $line);
for ($j = 0; $j < 15; $j++) {
$n = 0;
for ($k = 0; $k < 8; $k++) {
$n = ($n << 8) & 0xff00;
if ($k < 7) {
$n += $mem[$i++];
}
&putc((($n >> ($k + 1)) & 0x7f));
if ($i > $end) {
if ($k < 7) {
&putc((($n << (6 - $k)) & 0x7f));
}
last;
}
}
if ($i > $end) {
last;
}
}
printf("\n");
$line += 1;
}
sub putc {
my ($c) = @_;
$c += 0x21;
if ($c >= 0x7f) {
$c += (0xa1 - 0x7f);
}
printf("%c", $c);
}
mandelbrot.apo
10 'Sa NL Ki Sz-eD4IekP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-㽬 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ・';3Ig#GmUEr-{DH)a*-Ga"ゥ g1%f-"23 js#Cga)jUQ'ヲ
11 'D'I#"1UdBra:1#G*C CDA%㽬!!i2iU9%Sae-1 a"Qxq ァ&BZ,![A1D.9/12neA1!Oh!bBwiYOh、KfカXA*|J.9D9Oh、4ァUウ!i-ca eX.)|DQ'ァ"Oer ")seQ7X ”C!GQMm ”L
12 's key A#!?"].picture)K;egvr.,Ju J-"nu'39GウエZ"/Kochi!A`Tuu
The following is a merged version of a BASIC program that reads this, writes it to memory, executes it as a machine language program, and uses TICK () to measure and display the execution time.
text:mandelbrot.gcc.1.bas
1 ' Mandelbrot set(gcc)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:Z=USR(#700,0):LC0,23:?TICK()
10 'Sa NL Ki Sz-eD4IekP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-㽬 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ・';3Ig#GmUEr-{DH)a*-Ga"ゥ g1%f-"23 js#Cga)jUQ'ヲ
11 'D'I#"1UdBra:1#G*C CDA%㽬!!i2iU9%Sae-1 a"Qxq ァ&BZ,![A1D.9/12neA1!Oh!bBwiYOh、KfカXA*|J.9D9Oh、4ァUウ!i-ca eX.)|DQ'ァ"Oer ")seQ7X ”C!GQMm ”L
12 's key A#!?"].picture)K;egvr.,Ju J-"nu'39GウエZ"/Kochi!A`Tuu
I entered this into IchigoJam and executed it as follows. The numerical value of the result is too small, and the magnitude of the value after the decimal point truncated at the time of measurement is too large as an error to evaluate the result, so change the program
python
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
Run 100 times and then evaluate the result. It says 293, but remove Z = USR (# 700,0) from the previous line
python
6 CLS:CLT:FORI=1TO100:NEXT:LC0,23:?TICK()
The result of executing only the loop 100 times without executing the machine language part is Since it is 9, (293-9) / 100 = 2.84 is the approximate execution time per execution.
The line calling arm-none-eabi-gcc in the previous Makefile
clang -Wall -W -target arm-eabi -mcpu=cortex-m0 -fpic -Ofast -S $<
You can change the C compiler to use to clang by changing to. The optimization option specifies `-Ofast'as with gcc. The BASIC program that incorporates the code generated by clang is as follows.
text:mandelbrot.clang.1.bas
1 ' Mandelbrot set(clang)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'NRl!'C!1"S!C<!A#E-%:V3a-#s/)^c9&!D o=㽬'*2g ”N!"(153bE!#JB.3!!5a,>Key d-2cIA$mB$!-"'~!jw5<EU#nCD!}{=a%FgU=!c:Bb o 1+u*:%ィN59㽬8j
11 '-!$1Ua=g&H!z i mP<(]ヲ*ye]g\1 So^So>ァ-ャ ・ 4 ・:idq'BZui(1B$\(!"-<Pq i!!KV%-<Dc"Xl|%G!Shi k&!'m!O!m{=A,X-)#Y/-&kGe!A'M?,Oh$~ツツツ「ツセSo"}
12 'Twip Tachi"!!!!
If you do this, the result will be It became 307. , (307-9) / 100 = 2.98 is the approximate execution time per run. Unfortunately for the clang sect, this test is slightly (about 5%) inferior to gcc.
In this Makefile, the assembler source is output once at the time of build, assembled and linked. The contents of the assembler source generated in the case of the previous gcc are as follows.
mandelbrot.s
.syntax unified
.cpu cortex-m0
.fpu softvfp
.eabi_attribute 23, 1
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.thumb
.syntax unified
.file "mandelbrot.c"
.text
.align 2
.global mandelbrot
.code 16
.thumb_func
.type mandelbrot, %function
mandelbrot:
@ args = 0, pretend = 0, frame = 24
@ frame_needed = 0, uses_anonymous_args = 0
push {r4, r5, r6, r7, lr}
mov r7, fp
mov r6, r10
mov r4, r8
mov r5, r9
ldr r3, .L12
push {r4, r5, r6, r7}
mov r8, r3
movs r3, #0
sub sp, sp, #28
str r3, [sp, #20]
movs r3, #128
lsls r3, r3, #7
movs r7, #63
mov r10, r3
str r1, [sp, #16]
.L6:
movs r1, #1
ldr r2, [sp, #20]
ldr r6, .L12+4
asrs r3, r2, #1
lsls r3, r3, #5
ands r1, r2
str r3, [sp, #8]
lsls r3, r1, #1
str r3, [sp, #12]
movs r3, #0
mov fp, r3
.L5:
movs r2, #144
mov r3, fp
lsls r2, r2, #4
mov ip, r2
ldr r2, [sp, #8]
asrs r3, r3, #1
add r3, r3, ip
mov ip, r2
ldr r2, [sp, #16]
add r3, r3, ip
mov ip, r2
mov r2, fp
add ip, ip, r3
movs r3, #1
ands r3, r2
ldr r2, [sp, #12]
movs r1, r6
mov r9, r2
movs r2, #1
add r3, r3, r9
lsls r2, r2, r3
mov r9, r2
movs r2, #128
mov r3, r9
orrs r2, r3
mov r3, ip
ldrb r3, [r3]
movs r4, #32
orrs r2, r3
mov r3, r8
str r2, [sp, #4]
str r3, [sp]
b .L4
.L2:
subs r1, r1, r2
ldr r2, [sp]
lsls r0, r0, #1
mov r8, r2
muls r3, r0
subs r4, r4, #1
adds r1, r6, r1
add r3, r3, r8
cmp r4, #0
beq .L11
.L4:
asrs r0, r1, #31
asrs r2, r3, #31
ands r0, r7
ands r2, r7
adds r1, r0, r1
adds r3, r2, r3
asrs r0, r1, #6
asrs r3, r3, #6
movs r1, r0
movs r2, r3
muls r1, r0
muls r2, r3
adds r5, r1, r2
cmp r5, r10
ble .L2
ldr r3, [sp]
mov r2, sp
mov r8, r3
mov r3, ip
ldrb r2, [r2, #4]
strb r2, [r3]
.L3:
movs r3, #1
mov ip, r3
add fp, fp, ip
mov r3, fp
adds r6, r6, #192
cmp r3, #64
bne .L5
movs r2, #192
mov ip, r2
ldr r3, [sp, #20]
add r8, r8, ip
adds r3, r3, #1
str r3, [sp, #20]
cmp r3, #48
bne .L6
movs r0, #0
add sp, sp, #28
@ sp needed
pop {r2, r3, r4, r5}
mov r8, r2
mov r9, r3
mov r10, r4
mov fp, r5
pop {r4, r5, r6, r7, pc}
.L11:
mov r2, r9
ldr r3, [sp, #4]
eors r3, r2
mov r2, ip
strb r3, [r2]
b .L3
.L13:
.align 2
.L12:
.word -4608
.word -9216
.size mandelbrot, .-mandelbrot
.ident "GCC: (GNU Tools for ARM Embedded Processors) 5.4.1 20160919 (release) [ARM/embedded-5-branch revision 240496]"
Since `-Ofast'is specified in the optimization option, it is assumed that the code that emphasizes execution speed is spit out, but after all it is due to the modern compiler, so there is a wasteful part. One of the theories of speeding up is "focus on the inside of the loop." The C program this time has a triple loop structure, but the innermost loop part
python
int p = u, q = v, i = 0;
do {
int r = p / 64; p = r * r; if (p < 0) goto l40;
int s = q / 64; q = s * s; if (q < 0) goto l40;
if (p + q < 0 || p + q > b) goto l40;
p = p - q + u;
q = 2 * r * s + v;
} while (++i < n);
I will consider speeding up only for. The innermost loop part is the upper assembler source
python
movs r4, #32
orrs r2, r3
mov r3, r8
str r2, [sp, #4]
str r3, [sp]
b .L4
.L2:
subs r1, r1, r2
ldr r2, [sp]
lsls r0, r0, #1
mov r8, r2
muls r3, r0
subs r4, r4, #1
adds r1, r6, r1
add r3, r3, r8
cmp r4, #0
beq .L11
.L4:
asrs r0, r1, #31
asrs r2, r3, #31
ands r0, r7
ands r2, r7
adds r1, r0, r1
adds r3, r2, r3
asrs r0, r1, #6
asrs r3, r3, #6
movs r1, r0
movs r2, r3
muls r1, r0
muls r2, r3
adds r5, r1, r2
cmp r5, r10
ble .L2
It has been converted to the code. In the C source, the process of getting out of the loop with goto when the variables p and q become negative was a countermeasure against the multiplication overflow that can occur because the variable size is 16 bits in the original BASIC program. is. When porting to C, the size of the variable is 32bit, and the judgment of overflow should not be logically necessary, but it is incorporated for the purpose of faithfully porting from the BASIC program. I was surprised that the output code of the above compiler had that part removed cleanly by optimization. [Added 2017.9.3] C language does not consider signed integer overflow, so the code below
int r = p / 64; p = r * r; if (p < 0) goto l40;
int s = q / 64; q = s * s; if (q < 0) goto l40;
Negative results of squared r and s are not considered, and it seems that the optimization removed them as "useless processing".
Besides, as a quick look
I was concerned about the above points. The previous part
python
movs r4, #32
orrs r2, r3
mov r3, r8
str r2, [sp, #4]
str r3, [sp]
movs r7, r3
.L4:
asrs r0, r1, #6
asrs r2, r3, #6
lsrs r0, r0, #26
lsrs r2, r2, #26
adds r1, r0, r1
adds r3, r2, r3
asrs r1, r1, #6
asrs r3, r3, #6
movs r0, r1
muls r0, r3
muls r1, r1
muls r3, r3
adds r5, r1, r3
cmp r5, r10
bgt 1f
subs r1, r1, r3
lsls r3, r0, #1
adds r1, r6, r1
adds r3, r7
subs r4, r4, #1
bne .L4
b .L11
1:
By changing to, these points will be resolved. The following is a build of mandelbrot.s with this modification and converted to a BASIC program.
text:mandelbrot.gcc.opt.bas
1 ' Mandelbrot set(hand optimized)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NL Ki Sz-eD4I ヲ cP, Um2a#=?"&j"%^y%Chi Hn2aj)%c&n-ji 9bW")q!J;M!$ja%Z{;ACNrcA-r-#n7 ・';3Ig#GmUEr-{DH)a*-Ga"ゥ g1%f-"23 js#Cga)jUQ'ヲ
11 'D'I#"1UdBra:1#G@!C#:qg!/j$i2gmR*)ゥ ・ 1a"QdEq YO U?UQvCb|mi'!Yg<Ri%Shu h=!%ー ugm2mUQj;Y!EZ ・<geNry$S!x\Wow%JC9,<C2!497GQ6;1"!0Q
12 '?P3%U.CDK&5g!U U$*-4YY}So#9""Atsu".Echi
The result of doing this is It became 253. (253-9) / 100 = 2.44 is the approximate time for each session. What was 2.84 before manual optimization is now 2.44, which is about 16.4% faster. I think the effect was great for the optimization performed in a limited range.
In the previous C program, the address of VRAM is calculated for each dot drawn. There are 4 dots in one address of VRAM, but it is useless to calculate the same address 4 times, so let's think about improvement. This point should be mitigated by calculating the four dots that exist at the same VRAM address in succession. By taking advantage of the continuous VRAM address, the VRAM address calculation itself can be omitted. The following is an application of this to the C program.
mandelbrot.c
#include <stdint.h>
int mandelbrot(int16_t arg, uint8_t* ram, const uint8_t* font)
{
(void)arg; (void)font;
int n = 32;
int w = 64, h = 48;
int a = 2 << 6, b = a * a, c = (3 << 12) / w;
uint8_t* z = ram + 0x900;
int v = (-h / 2) * c;
for (int y = 0; y < h; y += 2) {
int u = -((w * 3) / 4) * c;
for (int x = 0; x < w; x += 2) {
#define mandelbrot2(u, v) ({ \
int p = u; \
int q = v; \
int a = 0; \
int i = 0; \
do { \
int r = p / 64; p = r * r; \
int s = q / 64; q = s * s; \
if (p + q < 0 || p + q > b) { \
a = 1; \
break; \
} \
p = p - q + u; \
q = 2 * r * s + v; \
} while (++i < n); \
(a); \
})
int k = 0x80;
if (mandelbrot2(u, v)) {
k += 1;
}
v += c;
if (mandelbrot2(u, v)) {
k += 4;
}
u += c;
if (mandelbrot2(u, v)) {
k += 8;
}
v -= c;
if (mandelbrot2(u, v)) {
k += 2;
}
*z++ = k;
u += c;
}
v += 2 * c;
}
return 0;
}
The VRAM address is set first and only incremented every 4 dots written. The following is a compilation of this with gcc and a BASIC program.
mandelbrot.gcc.2.bas
1 ' Mandelbrot set(gcc)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'i)¡,KV-eD8 ㆍ S9E"&a3eXc 㽦 CA!-ngl@j45W!glCH)"--Ao!E UR!apDM)"p"3!Fq 㽹 Bs*)"=&a.y ァ!I!UU(iOCAED>q}ea gg22-C/!'a%O.Ud{g65U!l ー
11 '!I.?Ke#DD2x5U,-ヲ D-A#"!0 "`a0U=EU"=CD7;1!I+U:E_OCA*Oh(Sa G9 "eTR"*ァ!;!?IVs/;9v2q#PU$KC\qB/UgOh2'Cormorant"!fa<㽩;9 o!,!!Chi#!!dB ・ DA*Set 3:
12 'Rr!#Ka 4Bi,C o#!eaAg;ReAD<)aA"q"#dN1 erfN.!3 s A;{8l~%)Uu U.qC)%E{9 o!z2j ァ V-8!"So%!"'dega4 su F*<-a&Ka Wd18 ー&ァ%"*!b-U ヲ)agW2!a$A#&'{
13 'B'D+{;!Fcaa VUP8[#2IS;Ugo 2)EQ#X"<'o!*m"4 ")Tuka Q$?B!;'(=$<1-":?!g|!H)㽬"0ceNaT5I$Ao mosquito a%!>a shi]i2t5V*.LD]8|+:Mg A Z#;:.ZDG)ゥ p>Mg
14 'xZ service%Q$~ツ ツ a<p
The result of execution is (269-9) / 100 = 2.60, which is an improvement of about 9% compared to 2.84 before changing the logic. It's not as good as the manual optimization above, but I think it's a small effect as an easy method. By the way, if you build with clang, the program will
mandelbrot.clang.2.bas
1 ' Mandelbrot set(clang)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NR\!Ea!"#$!"&Mk"EA*)CFa:"O FCI$1"GAj)7A"1A!Qua&>%GB:1z.i,"""a・#2&2kE@aP!(-$)Aemg\1 ke"e%g+p a)Aw"49a'q, kR'"!#45),{=a)ァ#B 0W!
11 '3cTA]Set A8a$c"e%C+D5i`!¡!/9'1b*"-aB, d!u chiso c&W":IS3.4)W!!0:D'.r,!|Ea]!B!'d%ゥ 9$!!m'i!Different "]ツ A8a$c"e1CCD5i`a¡!/='1r*B-aB, ヲ!u ゥ?c
12 '&W!KiQ3.:IWa!0:Dg/,,!~Ea]!b!'d"ァ ”#Q;[)n[#Sa!Yq9c%EuTV(q,"",!・!2"2kE;eS$4ツ"1uアSc&Ke!ElER..9)U!!2:B'/,,!~Aka 18!%A"ゥ A597Q"
13 'Y#TN1*:eh)"・!s!-;=Wee!:"2a*mEY"m)'$"E"{L Shi "#!O, U.)"(\ツ ツ!X@Tasso d Tata`aA!!!
And the result of execution is The result was (294-9) / 100 = 2.85. It's better than the previous example 2.98 in clang, but it's about 4% better, not as much as in gcc.
I wrote the program in assembler based on the experience of manual optimization and logic change.
mandelbrot.s
.syntax unified
.cpu cortex-m0
.thumb
.macro mandelbrot2 bit
mov r2, r8
mov r3, r9
movs r4, #n
0:
asrs r0, r2, #6
lsrs r0, r0, #26
adds r2, r2, r0
asrs r2, r2, #6
asrs r0, r3, #6
lsrs r0, r0, #26
adds r3, r3, r0
asrs r3, r3, #6
movs r1, r2
muls r1, r3
muls r2, r2
muls r3, r3
adds r0, r2, r3
cmp r0, r10
bgt 9f
subs r2, r2, r3
add r2, r8
lsls r3, r1, #1
add r3, r9
subs r4, #1
bne 0b
subs r5, #\bit
9:
.endm
.text
.align 1
.global mandelbrot
mandelbrot:
.equiv vramoffset, 0x900
.equiv n, 32
.equiv w, 64
.equiv h, 48
.equiv a, 2<<6
.equiv b, a*a
.equiv c, (3<<12)/w
push {r4, r5, r6, r7, lr}
mov r3, r8
mov r4, r9
mov r5, r10
mov r6, r11
mov r7, r12
push {r3, r4, r5, r6, r7}
movs r0, #vramoffset>>4
lsls r0, r0, #4
adds r6, r0, r1
movs r0, #c
mov r12, r0
rsbs r0, #0
mov r14, r0
movs r0, #b>>12
lsls r0, r0, #12
mov r10, r0
ldr r7, uend
ldr r0, vend
mov r11, r0
ldr r0, vstart
mov r9, r0
yloop:
ldr r0, ustart
mov r8, r0
xloop:
movs r5, #0x8f
mandelbrot2 1
add r9, r12
mandelbrot2 4
add r8, r12
mandelbrot2 8
add r9, r14
mandelbrot2 2
add r8, r12
strb r5, [r6]
adds r6, #1
cmp r7, r8
bne xloop
add r9, r12
add r9, r12
cmp r11, r9
bne yloop
pop {r3, r4, r5, r6, r7}
mov r8, r3
mov r9, r4
mov r10, r5
mov r11, r6
mov r12, r7
movs r0, #0
pop {r4, r5, r6, r7, pc}
.align 2
vstart:
.word (-h/2)*c
vend:
.word (h/2)*c
ustart:
.word -((w*3)/4)*c
uend:
.word ((w*(4-3))/4)*c
.end
As a result of trying to make the best use of registers, it is no longer necessary to put data in the stack frame that existed in the output code of the C compiler. The following is a BASIC program.
mandelbrot.asm.1.bas
1 ' Mandelbrot set(asm)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NIUSR-vD8i[<E Ui)!!+9Ra1B)ァ#+-g#)!!=*-dHq*);9 Soiarh key".03qi ァ{9aEi%Q!W iR3)ゥ#9!YW9neC1#I'sBw^!, Af$X;Bs*)l!3*Ka vr!p]5C*-l
11 'D)%j!g!/*'3B-"D!('ヲ*yeC!MQE;-(q-5)qW Ka E;B2*Q#N)"?[{1B co aeB2jUR!j1)"", 1 ・ E2m%Q!Wu R<)eA&k.EdNq"cb+'UdE327!Fr!4 LC)?So Ie3:7g
12 '1*3"-!=3-Ec:ag!/.g4R)e!zBuiV{0A9I2A-iUcC3 "%{1#]UAD")VY!G[7Fr ur=5GN,-Ochi 8jc;cgv2wUV-!1]8}#9""Atsu""1!!!A`Tuu!Q!!
The execution result is (215-9) / 100 = 2.06, which is the fastest result so far.
The table below shows the above results.
Processing system | program | processing time(The unit is about 1/60 seconds) | processing time最短のものと比べての倍率 |
---|---|---|---|
IchigoJam BASIC | original | 31008 | 15052.43 |
gcc 5.4.1 | Faithful porting from BASIC programs | 2.84 | 1.38 |
clang 3.9.1 | Faithful porting from BASIC programs | 2.98 | 1.45 |
gcc 5.4.1 +Manual optimization | Faithful porting from BASIC programs | 2.44 | 1.18 |
gcc 5.4.1 | Logic minor changes | 2.60 | 1.26 |
clang 3.9.1 | Logic minor changes | 2.85 | 1.38 |
GNU Binutils 2.26.2 | 2.06 | 1.00 |
Summary,
That's it. In thinking about improving the execution speed of the program
Etc. are annoying, but you should decide by considering the balance between the cost and effect required for coding and maintenance of the program. This time it's a complete play, so the cost is out of consideration. If you write the program in a high-level language such as C, it is easy to devise the logic for speeding up, and conversely, once you write it in assembler, it will be difficult to change the logic, so first of all, C etc. Writing in a high-level language and devising logic seems to be an effective means of optimization. The gcc output code I saw this time was clearly useless. I would like to expect further improvements to gcc and clang.
About the division by 64 where the variables r and s are calculated in the C program
python
int r = p / 64;
int s = q / 64;
In the gcc output code
python
.L4:
asrs r0, r1, #31
asrs r2, r3, #31
ands r0, r7
ands r2, r7
adds r1, r0, r1
adds r3, r2, r3
asrs r0, r1, #6
asrs r3, r3, #6
The sign bit of the variables p and q is extended to 32bit, masked with the value of r7 (63), added to the original value, and 6bit arithmetic right-shifted. Although the assembler version of the program uses registers differently, the idea is the same.
python
0:
asrs r0, r2, #6
lsrs r0, r0, #26
adds r2, r2, r0
asrs r2, r2, #6
asrs r0, r3, #6
lsrs r0, r0, #26
adds r3, r3, r0
asrs r3, r3, #6
This is a correction to a negative value when dividing a signed value by a power of 2 to make it converge to 0 regardless of whether the dividend is positive or negative. Simple arithmetic Using only right shift will converge to 0 for positive values, but will converge to -1 for negative values, so we are taking measures against that. In this C program, the size of the variable has been changed from 16bit to 32bit from the original BASIC program, and since there is a margin in accuracy, the range of values that can be taken for the variables p and q in the program is 15bit on the MSB side. Always have the same value. If you use this, the upper part of the previous assembler program will be
python
0:
lsrs r0, r2, #26
adds r2, r2, r0
asrs r2, r2, #6
lsrs r0, r3, #26
adds r3, r3, r0
asrs r3, r3, #6
It is possible to write. The following is the version converted into a BASIC program after making this change.
text:mandelbrot.asm.2.bas
1 ' Mandelbrot set(asm)
2 Z=#C03:Y=#700
3 X=0
4 IFPEEK(Z)=39X=X-1:W=33-PEEK(Z-X):IFW<1V=V<<7-W+W/96*34:POKEY,V>>(X&7):Y=Y+(X&7<7):GOTO4
5 Z=Z+PEEK(Z-1)+4:IFPEEK(Z-3)+PEEK(Z-2)GOTO3
6 CLS:CLT:FORI=1TO100:Z=USR(#700,0):NEXT:LC0,23:?TICK()
10 'Sa NIUSR-vD8i[<E Ui)!!+9Ra1B)ァ#+-g#)!!=*-`Hp*);9 key iarhUe".03qi ァ{9aEi$cBeiD9('ヲ*yeC!MQE;-(q-5)qW Ka E;B2*Q#N)"?\;1*Co ceB2jUR!j1
11 '(%d*1gQ/.g4R)e!zBuiV{0A9I2A-iUcC3 "%{1#]UAd")cD3"3A/*'3B-"=<-G ・ 2)"SdJ1 YO U?aQqCb|n1%eFa*UA%Shoc("+2%gFre#Ea=3-Ec:aYW9
12 'neC1#I'sBw^!, Af$X;Bs*)l!3*Ka zr"0]%Bv ""<2 ozwhce ur:UMtD service. G)Shu+;ugzrxeQ!b u. u.)"(\ツ ツ!%a!!$Zツ ツa"a!!
Doing this resulted in (198-9) / 100 = 1.89 (image capture omitted). The following is an updated version of the summary table.
Processing system | program | processing time(The unit is about 1/60 seconds) | processing time最短のものと比べての倍率 |
---|---|---|---|
IchigoJam BASIC | original | 31008 | 16406.35 |
gcc 5.4.1 | Faithful porting from BASIC programs | 2.84 | 1.50 |
clang 3.9.1 | Faithful porting from BASIC programs | 2.98 | 1.58 |
gcc 5.4.1 +Manual optimization | Faithful porting from BASIC programs | 2.44 | 1.29 |
gcc 5.4.1 | Logic minor changes | 2.60 | 1.38 |
clang 3.9.1 | Logic minor changes | 2.85 | 1.51 |
GNU Binutils 2.26.2 | 2.06 | 1.09 | |
GNU Binutils 2.26.2 | Speed up division | 1.89 | 1.00 |
Recommended Posts