Brainfuck-> Java bytecode compiler

I made a program to compile Brainfuck source code into Java bytecode, so make a note. I will not explain the Java bytecode in detail. Normally, I would use a library such as ASM, but since it's a big deal, I'll handwrite the bytes.

The Java Virtual Machine Specification See Chapter 4 for class files and Chapter 6 for instruction sets.

What is Brainfuck?

Wikipedia

A Turing-complete programming language with only eight instructions. Not practical but interesting. Roughly speaking, the Brainfuck execution environment has an array of bytes and pointers to its elements, and it feels like operating it with an instruction set. See the link for details.

Preparation

Since it is not possible to create byte arrays in various ways, first define a convenience method for writing bytes together.

/**
 * Helper class to create byte array fluently.
 */
public final class FluentByteWriter {

    private final OutputStream os;

    public FluentByteWriter(OutputStream os) {
        this.os = os;
    }

    public FluentByteWriter write(Object... bytes) {

        try {
            for (Object o : bytes) {
                //There seems to be some languages that cannot be pattern matched
                if (o instanceof Integer) {
                    os.write((Integer) o);  // Note high-order bits are ignored
                } else if (o instanceof Byte) {
                    os.write((Byte) o);
                } else if (o instanceof byte[]) {
                    os.write((byte[]) o);
                } else if (o instanceof String) {
                    os.write(((String) o).getBytes(StandardCharsets.UTF_8));
                } else {
                    throw new UnsupportedOperationException("Unwritable class: " + o.getClass().getCanonicalName());
                }
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }

        return this;
    }

}

It receives an Object with a variable length argument and writes it to an OutputStream according to its type. ʻOutputStream.write (int)is a method that ignores the upper 24 bits of the argument int and writes (Java int is 4 bytes). Without it, Java hexadecimal literals are signed, so the most significant bit cannot write a valid value. (For example, when you try to write0xCA`, it is regarded as an int type instead of byte, and it cannot be used in a method that takes byte as an argument).

Next, define a helper that converts int to a 4-byte / 2-byte array and writes it. This also ignores the high-order bits.

public static byte[] toByteArray4(int n) {
    return ByteBuffer.allocate(4).putInt(n).array();
}
public static byte[] toByteArray2(short n) {
    return ByteBuffer.allocate(2).putShort(n).array();
}

For 4-byte version only, go to Guava [Ints.toByteArray ()](http://google.github.io/guava/releases/snapshot-jre/api/docs/com/google/common/primitives/Ints.html#toByteArray There is -int-), but since a 2-byte version is also required, it is uniquely defined. (When I think about it now, I'm happy with Shorts.toByteArray ())

compile

The specifications of the output bytecode are as follows.

Default package Main class
Since it has a public static void main (String [] args) method, it can be executed with the java command.
Brainfuck source code is compiled into the contents of the main method

In short, it allows you to execute bytecode converted by java Main.

Preparation

0xCA, 0xFE, 0xBA, 0xBE, // CAFEBABE
0x00, 0x00,  // miner version: 0
0x00, 0x31,  // major version: 49  // Version 49 doesn't require stack map
0x00, 0x20,  // constant pool count: 31 + 1
// constant pool
0x07, 0x00, 0x02,  // 1. class: Main
0x01, 0x00, 0x04,  // 2. utf8
"Main",
0x07, 0x00, 0x04,  // 3. class: java/lang/Object
0x01, 0x00, 0x10,  // 4. utf8
"java/lang/Object",
// System.out.print
0x09, 0x00, 0x06, 0x00, 0x08,  // 5. fieldref System.out
0x07, 0x00, 0x07,  // 6. class
0x01, 0x00, 0x10,  // 7. utf8
"java/lang/System",
0x0C, 0x00, 0x09, 0x00, 0x0A,  // 8. name and type
0x01, 0x00, 0x03,  // 9. utf8
"out",
0x01, 0x00, 0x15,  // 10. utf8
"Ljava/io/PrintStream;",
0x0A, 0x00, 0x0C, 0x00, 0x0E,  // 11. method PrintStream.print(int)
0x07, 0x00, 0x0D,  // 12. class
0x01, 0x00, 0x13,  // 13. utf8
"java/io/PrintStream",
0x0C, 0x00, 0x0F, 0x00, 0x10,  // 14. name and type
0x01, 0x00, 0x05,  // 15. utf8
"print",
0x01, 0x00, 0x04,  // 16. utf8
"(C)V",
// System.in.read(int)
0x09, 0x00, 0x06, 0x00, 0x12,  // 17. fieldref System.in
0x0C, 0x00, 0x13, 0x00, 0x14,  // 18. name and type
0x01, 0x00, 0x02,  // 19. utf8
"in",
0x01, 0x00, 0x15,  // 20. utf8
"Ljava/io/InputStream;",
0x0A, 0x00, 0x16, 0x00, 0x18,  // 21. method InputStream.read(int)
0x07, 0x00, 0x17,  // 22. class
0x01, 0x00, 0x13,  // 23. utf8
"java/io/InputStream",
0x0C, 0x00, 0x19, 0x00, 0x1A,  // 24. name and type
0x01, 0x00, 0x04,  // 25. utf8
"read",
0x01, 0x00, 0x3,  // 26. utf8
"()I",
// main
0x01, 0x00, 0x04,  // 27. utf8
"main",
0x01, 0x00, 0x16,  // 28. utf8
"([Ljava/lang/String;)V",
0x01, 0x00, 0x04,  // 29. utf8
"args",
0x01, 0x00, 0x13,  // 30. utf8
"[Ljava/lang/String;",
// "Code" for Attribute
0x01, 0x00, 0x04,  // 31. utf8
"Code",
0x00, 0x21,  // access_flags: ACC_SUPER ACC_PUBLIC
0x00, 0x01,  // this class
0x00, 0x03,  // super class
0x00, 0x00,  // interfaces count
// interfaces[]
//NOP
0x00, 0x00,  // fields count
// fields[]
// NOP
0x00, 0x01,  // method count

There is nothing special to mention. Write out ordinary Java bytecode.

It doesn't have <init> because it doesn't need to be instantiated. The bytecode version is 49 because it is troublesome to need a stack map from 50. (See here for stack maps (https://docs.oracle.com/javase/specs/jvms/se9/html/jvms-4.html#jvms-4.10))

        // methods[]
        // main
        0x00, 0x09,  // access flags: ACC_PUBLIC ACC_STATIC
        0x00, 0x1B,  // name index: main
        0x00, 0x1C,  // descriptor index
        0x00, 0x01,  // attributes count
        // attribute info
        0x00, 0x1F  // attribute name index: Code
);
byte[] code = compileCodes(is);
w.write(
        ByteUtils.toByteArray4(code.length + 12),  // attribute length
        // info
        0x00, 0x04,  // max stack
        0x00, 0x02,  // max locals
        // code length
        ByteUtils.toByteArray4(code.length),
        // code
        code,
        0x00, 0x00,  // exception table length
        // exception table
        // NOP
        0x00, 0x00,  // attribute count
        // attribute info[]
        // NOP
        // class attributes count
        0x00, 0x00
        // attributes
        // NOP

The compileCodes () method does the actual Brainfuck-Java bytecode conversion.

Brainfuck-Java bytecode conversion

First, prepare the environment.


w.write(
        // creates data buffer
        0x11, 0x75, 0x30,  // sipush 30000
        0xBC, 0x0A,        // newarray int
        0x4B,              // astore_0  // ignore application arguments (String[] args)
        // creates instruction pointer
        0x03,              // iconst_0
        0x3C               // istore_1
);
w.write(
        compileCodeElements(is),
        0xB1  // return
);

First make a buffer.

sipush 30000
newarray int
astore_0

What I made a mistake was making an array of ints instead of bytes. I don't care I don't care ...

Then make an instruction pointer.

iconst_0
istore_1

compileCodeElements()

Finally the actual code conversion.

while ((i = is.read()) >= 0) {
    switch (i) {
    case ('+'):
        // ...
        break;
    case ('-'):
        // ...
        break;
    /*
     * ....

It's easy because you don't need a lexer or parser. Read one character directly from the stream and branch the token with the switch.

0x2A,  // aload_0
0x1B,  // iload_1
0x5C,  // dup2
0x2E,  // iaload
0x04,  // iconst_1
0x60,  // iadd
0x4F   // iastore

dup2 is an instruction to duplicate the top two on the stack. Read the array values with ʻiaload` and save it.

All you're doing is just adding one. That's right.

0x2A,  // aload_0
0x1B,  // iload_1
0x5C,  // dup2
0x2E,  // iaload
0x02,  // iconst_m1
0x60,  // iadd
0x4F   // iastore

It's almost the same as + just by adding -1.

0x84, 0x01, 0x01  // iinc 1 1

0x84, 0x01, 0xFF  // iinc 1 -1

Local variables can be incremented / decremented with the ʻiinc` instruction, so they can be written with just one instruction.

0xB2, 0x00, 0x05,  // getstatic System.out
0x2A,              // aload_0
0x1B,              // iload_1
0x2E,              // iaload
0x92,              // i2c
0xB6, 0x00, 0x0B   // invokevirtual print(C)V

I want to output it as a character string, so convert int to char and then call System.out.print (char).

0x2A,              // aload_0
0x1B,              // iload_1
0xB2, 0x00, 0x11,  // getstatic System.in
0xB6, 0x00, 0x15,  // invokevirtual read()I
0x4F               // iastore

System.in.read () returns an int, so save it as is in the array.

[

w.write((Object) compileLoop(is));

The loop calls another method. If you call compileLoop () and compileElement () recursively, you can write the process neatly.

w.write(
        // load current pointer value
        0x2A,  // aload_0
        0x1B,  // iload_1
        0x2E   // iaload
);
byte[] insideLoop = compileCodeElements(is);
w.write(
        // if the current pointer indicates 0...
        // 6 = ifeq(3) + goto(3)
        0x99, ByteUtils.toByteArray2(insideLoop.length + 6),    // ifeq
        insideLoop,
        0xA7, ByteUtils.toByteArray2(-(insideLoop.length + 6))  // goto
);

First, read the value of the buffer pointed to by the current pointer. Check if it is 0 with the ʻifeqinstruction, if it is 0, jump to the end of the loop, otherwise process. At the end of processing,goto returns to the beginning of the loop. The reason for adding 6 to the code length is to exclude the length of the ʻifeq and goto instructions.

At first, I thought that the specified coordinates of the jump destination were absolute, so I was worried that it wouldn't work for some reason. Actually, specify the relative size from your current location. It's much easier to generate code. I was impressed that it was well thought out (small average feeling).

As mentioned above, the loops are processed using mutual recursion, so you can know the jump destination without counting the number of loops like the BF interpreter you often see.

]

case (']'):
    // exit loop
    return baos.toByteArray();

Exit the process and exit the loop. If it is called without a matching [, the process will end. Since I couldn't find the specification when the match was not found, I regarded it as undefined and made it easier. It may be kinder to make a compilation error.

optimisation

TODO

I will do it someday by referring to POSTD article.

Repeating the same instruction can be easily optimized. For example, with the + instruction, you only need to rewrite ʻiconst_1` according to the number of instructions.
The loop is fixed in position from the beginning.
It seems that "No. 3" ([-] etc.) of the article is not so difficult for the bytecode itself. If anything, it is more troublesome to operate the input.

Impressions

It was easier than I expected
There was no need to write in Java just because it outputs Java bytecode. Java is a hassle.

bonus

Original article (This Qiita article is a refinement of this article) Repository