I made a program to compile Brainfuck source code into Java bytecode, so make a note. I will not explain the Java bytecode in detail. Normally, I would use a library such as ASM, but since it's a big deal, I'll handwrite the bytes.
The Java Virtual Machine Specification See Chapter 4 for class files and Chapter 6 for instruction sets.
A Turing-complete programming language with only eight instructions. Not practical but interesting. Roughly speaking, the Brainfuck execution environment has an array of bytes and pointers to its elements, and it feels like operating it with an instruction set. See the link for details.
Since it is not possible to create byte arrays in various ways, first define a convenience method for writing bytes together.
/**
* Helper class to create byte array fluently.
*/
public final class FluentByteWriter {
private final OutputStream os;
public FluentByteWriter(OutputStream os) {
this.os = os;
}
public FluentByteWriter write(Object... bytes) {
try {
for (Object o : bytes) {
//There seems to be some languages that cannot be pattern matched
if (o instanceof Integer) {
os.write((Integer) o); // Note high-order bits are ignored
} else if (o instanceof Byte) {
os.write((Byte) o);
} else if (o instanceof byte[]) {
os.write((byte[]) o);
} else if (o instanceof String) {
os.write(((String) o).getBytes(StandardCharsets.UTF_8));
} else {
throw new UnsupportedOperationException("Unwritable class: " + o.getClass().getCanonicalName());
}
}
} catch (IOException e) {
throw new UncheckedIOException(e);
}
return this;
}
}
It receives an Object with a variable length argument and writes it to an OutputStream according to its type.
ʻOutputStream.write (int)is a method that ignores the upper 24 bits of the argument int and writes (Java int is 4 bytes). Without it, Java hexadecimal literals are signed, so the most significant bit cannot write a valid value. (For example, when you try to write
0xCA`, it is regarded as an int type instead of byte, and it cannot be used in a method that takes byte as an argument).
Next, define a helper that converts int to a 4-byte / 2-byte array and writes it. This also ignores the high-order bits.
public static byte[] toByteArray4(int n) {
return ByteBuffer.allocate(4).putInt(n).array();
}
public static byte[] toByteArray2(short n) {
return ByteBuffer.allocate(2).putShort(n).array();
}
For 4-byte version only, go to Guava [Ints.toByteArray ()](http://google.github.io/guava/releases/snapshot-jre/api/docs/com/google/common/primitives/Ints.html#toByteArray There is -int-), but since a 2-byte version is also required, it is uniquely defined.
(When I think about it now, I'm happy with Shorts.toByteArray ()
)
The specifications of the output bytecode are as follows.
Main
classpublic static void main (String [] args)
method, it can be executed with the java
command.In short, it allows you to execute bytecode converted by java Main
.
0xCA, 0xFE, 0xBA, 0xBE, // CAFEBABE
0x00, 0x00, // miner version: 0
0x00, 0x31, // major version: 49 // Version 49 doesn't require stack map
0x00, 0x20, // constant pool count: 31 + 1
// constant pool
0x07, 0x00, 0x02, // 1. class: Main
0x01, 0x00, 0x04, // 2. utf8
"Main",
0x07, 0x00, 0x04, // 3. class: java/lang/Object
0x01, 0x00, 0x10, // 4. utf8
"java/lang/Object",
// System.out.print
0x09, 0x00, 0x06, 0x00, 0x08, // 5. fieldref System.out
0x07, 0x00, 0x07, // 6. class
0x01, 0x00, 0x10, // 7. utf8
"java/lang/System",
0x0C, 0x00, 0x09, 0x00, 0x0A, // 8. name and type
0x01, 0x00, 0x03, // 9. utf8
"out",
0x01, 0x00, 0x15, // 10. utf8
"Ljava/io/PrintStream;",
0x0A, 0x00, 0x0C, 0x00, 0x0E, // 11. method PrintStream.print(int)
0x07, 0x00, 0x0D, // 12. class
0x01, 0x00, 0x13, // 13. utf8
"java/io/PrintStream",
0x0C, 0x00, 0x0F, 0x00, 0x10, // 14. name and type
0x01, 0x00, 0x05, // 15. utf8
"print",
0x01, 0x00, 0x04, // 16. utf8
"(C)V",
// System.in.read(int)
0x09, 0x00, 0x06, 0x00, 0x12, // 17. fieldref System.in
0x0C, 0x00, 0x13, 0x00, 0x14, // 18. name and type
0x01, 0x00, 0x02, // 19. utf8
"in",
0x01, 0x00, 0x15, // 20. utf8
"Ljava/io/InputStream;",
0x0A, 0x00, 0x16, 0x00, 0x18, // 21. method InputStream.read(int)
0x07, 0x00, 0x17, // 22. class
0x01, 0x00, 0x13, // 23. utf8
"java/io/InputStream",
0x0C, 0x00, 0x19, 0x00, 0x1A, // 24. name and type
0x01, 0x00, 0x04, // 25. utf8
"read",
0x01, 0x00, 0x3, // 26. utf8
"()I",
// main
0x01, 0x00, 0x04, // 27. utf8
"main",
0x01, 0x00, 0x16, // 28. utf8
"([Ljava/lang/String;)V",
0x01, 0x00, 0x04, // 29. utf8
"args",
0x01, 0x00, 0x13, // 30. utf8
"[Ljava/lang/String;",
// "Code" for Attribute
0x01, 0x00, 0x04, // 31. utf8
"Code",
0x00, 0x21, // access_flags: ACC_SUPER ACC_PUBLIC
0x00, 0x01, // this class
0x00, 0x03, // super class
0x00, 0x00, // interfaces count
// interfaces[]
//NOP
0x00, 0x00, // fields count
// fields[]
// NOP
0x00, 0x01, // method count
There is nothing special to mention. Write out ordinary Java bytecode.
It doesn't have <init>
because it doesn't need to be instantiated.
The bytecode version is 49 because it is troublesome to need a stack map from 50.
(See here for stack maps (https://docs.oracle.com/javase/specs/jvms/se9/html/jvms-4.html#jvms-4.10))
// methods[]
// main
0x00, 0x09, // access flags: ACC_PUBLIC ACC_STATIC
0x00, 0x1B, // name index: main
0x00, 0x1C, // descriptor index
0x00, 0x01, // attributes count
// attribute info
0x00, 0x1F // attribute name index: Code
);
byte[] code = compileCodes(is);
w.write(
ByteUtils.toByteArray4(code.length + 12), // attribute length
// info
0x00, 0x04, // max stack
0x00, 0x02, // max locals
// code length
ByteUtils.toByteArray4(code.length),
// code
code,
0x00, 0x00, // exception table length
// exception table
// NOP
0x00, 0x00, // attribute count
// attribute info[]
// NOP
// class attributes count
0x00, 0x00
// attributes
// NOP
The compileCodes ()
method does the actual Brainfuck-Java bytecode conversion.
First, prepare the environment.
w.write(
// creates data buffer
0x11, 0x75, 0x30, // sipush 30000
0xBC, 0x0A, // newarray int
0x4B, // astore_0 // ignore application arguments (String[] args)
// creates instruction pointer
0x03, // iconst_0
0x3C // istore_1
);
w.write(
compileCodeElements(is),
0xB1 // return
);
First make a buffer.
sipush 30000
newarray int
astore_0
What I made a mistake was making an array of ints instead of bytes. I don't care I don't care ...
Then make an instruction pointer.
iconst_0
istore_1
compileCodeElements()
Finally the actual code conversion.
while ((i = is.read()) >= 0) {
switch (i) {
case ('+'):
// ...
break;
case ('-'):
// ...
break;
/*
* ....
It's easy because you don't need a lexer or parser. Read one character directly from the stream and branch the token with the switch.
0x2A, // aload_0
0x1B, // iload_1
0x5C, // dup2
0x2E, // iaload
0x04, // iconst_1
0x60, // iadd
0x4F // iastore
dup2
is an instruction to duplicate the top two on the stack.
Read the array values with ʻiaload` and save it.
All you're doing is just adding one. That's right.
0x2A, // aload_0
0x1B, // iload_1
0x5C, // dup2
0x2E, // iaload
0x02, // iconst_m1
0x60, // iadd
0x4F // iastore
It's almost the same as +
just by adding -1
.
0x84, 0x01, 0x01 // iinc 1 1
<
0x84, 0x01, 0xFF // iinc 1 -1
Local variables can be incremented / decremented with the ʻiinc` instruction, so they can be written with just one instruction.
.
0xB2, 0x00, 0x05, // getstatic System.out
0x2A, // aload_0
0x1B, // iload_1
0x2E, // iaload
0x92, // i2c
0xB6, 0x00, 0x0B // invokevirtual print(C)V
I want to output it as a character string, so convert int to char and then call System.out.print (char)
.
,
0x2A, // aload_0
0x1B, // iload_1
0xB2, 0x00, 0x11, // getstatic System.in
0xB6, 0x00, 0x15, // invokevirtual read()I
0x4F // iastore
System.in.read ()
returns an int, so save it as is in the array.
[
w.write((Object) compileLoop(is));
The loop calls another method. If you call compileLoop ()
and compileElement ()
recursively, you can write the process neatly.
w.write(
// load current pointer value
0x2A, // aload_0
0x1B, // iload_1
0x2E // iaload
);
byte[] insideLoop = compileCodeElements(is);
w.write(
// if the current pointer indicates 0...
// 6 = ifeq(3) + goto(3)
0x99, ByteUtils.toByteArray2(insideLoop.length + 6), // ifeq
insideLoop,
0xA7, ByteUtils.toByteArray2(-(insideLoop.length + 6)) // goto
);
First, read the value of the buffer pointed to by the current pointer.
Check if it is 0 with the ʻifeqinstruction, if it is 0, jump to the end of the loop, otherwise process. At the end of processing,
goto returns to the beginning of the loop. The reason for adding 6 to the code length is to exclude the length of the ʻifeq
and goto
instructions.
At first, I thought that the specified coordinates of the jump destination were absolute, so I was worried that it wouldn't work for some reason. Actually, specify the relative size from your current location. It's much easier to generate code. I was impressed that it was well thought out (small average feeling).
As mentioned above, the loops are processed using mutual recursion, so you can know the jump destination without counting the number of loops like the BF interpreter you often see.
]
case (']'):
// exit loop
return baos.toByteArray();
Exit the process and exit the loop.
If it is called without a matching [
, the process will end.
Since I couldn't find the specification when the match was not found, I regarded it as undefined and made it easier.
It may be kinder to make a compilation error.
TODO
I will do it someday by referring to POSTD article.
+
instruction, you only need to rewrite ʻiconst_1` according to the number of instructions.[-]
etc.) of the article is not so difficult for the bytecode itself. If anything, it is more troublesome to operate the input.Original article (This Qiita article is a refinement of this article) Repository