1 Implement simple lexical analysis in Java

Introduction

There is lexical analysis in the processing process of the compiler implementation. I would like to implement a very primitive lexical analysis in Java.

I was inspired by the Compiler Study Group held on January 28, 2017. Although it is far from the advanced content of the presentation of this study session and I do not have such knowledge, I thought it would be fun if there was an article about the compiler implementation Hello, World !.

What you want to do with lexical analysis

When implementing, let's first confirm what you want to do. Lexical analysis breaks down a programmed string into what are called lexical or tokens. Let's look at an example. There is a programmed string that assigns the result of the addition to a variable.

ans1 = 10 + 20

The purpose of lexical analysis is to decompose this string into five parts: ʻans1, =, 10, + , and 20. The decomposed string is called a lexical or token. In addition, each of these five tokens is a meaningful unit. You can intuitively understand the meaning, but ʻans1 is a variable, = and + are operators, and 10 and 20 are numbers. It is also the purpose of lexical analysis to give such meanings to tokens. In the implementation, "string is ʻans1 and meaning is variable", "string is = `and meaning is operator", etc. It is more convenient to put the strings and their meanings together. Therefore, in the implementation, the character string and the meaning are combined into a token object. To summarize what I want to do with lexical analysis again, "I want to decompose the programmatic character string into a character string and a token that represents the meaning."

How to disassemble

Now that I know what I want to do, I think about how to do it. To disassemble, constrain the token and use it. The restrictions are as follows.

Consider this constraint on variable names, operators, and numbers. Determine the constraint for the first character.

It also shows the restriction that there is no duplication in the first character.

Let's use this constraint to break it down into tokens. Let's look at the previous example again.

ans1 = 10 + 20

Trace the character string in the example program character by character in order from the left. Trace in the order of ʻa ns 1... until the last0. When you trace the first ʻa here, ʻa is an alphabet, so Due to constraints, ʻa is determined to be the first character of the variable name. Since the variable name has been set, the following n s1 is also traced in order as a string of variable name strings. The character after 1 is not part of the variable name, so it can be decomposed into tokens as a string corresponding to the variable name. I will continue to trace. The space before = is not related to any token, so simply skip it. And I meet =. This is also determined as an operator from the constraint, and the operator token can be decomposed. To summarize how to decompose, create a token constraint, trace the character string in the program, and decompose it into tokens with the meaning that meets the constraint of the first character in order to the end.

Try to implement in Java

Move on to implementation. First is the class that represents the token. There are two field variables in the class: a kind that represents the meaning and a value that holds the string. In the class that performs lexical analysis, the character string that is the program is decomposed into this Token. Implemented toString () to check the contents.

Token.java



public class Token {

    public String kind;
    public String value;

    @Override
    public String toString() {
        return kind + " \"" + value + "\"";
    }

}

Let's look at a class that performs lexical analysis. Let's take a partial look. This is the beginning of the process. There is a text field that holds the programmatic string. And there is an i-field that serves as an index for tracing that string. Initialize them with the init method.

Lexer.java



import java.util.ArrayList;
import java.util.List;

public class Lexer {

    private String text;
    private int i;

    public Lexer init(String text) {
        i = 0;
        this.text = text;
        return this;
    }

This is the part of the mechanism that traces the character string. ʻIs EOTmeans that there are no more characters to trace. c ()is a programmatic string, the character at the point of interest while tracing. next ()` returns the character of interest and advances the position of interest to the next.

Lexer.java


    private boolean isEOT() {
        return text.length() <= i;
    }

    private char c() throws Exception {
        if (isEOT()) {
            throw new Exception("No more character");
        }
        return text.charAt(i);
    }

    private char next() throws Exception {
        char c = c();
        ++i;
        return c;
    }

I will explain in order the part that uses the mechanism of tracing the character string. skipSpace () skips characters that are not related to tokens, such as spaces. No matter how many spaces are in the beginning or end of the program string or in the middle of the expression, this is OK.

Lexer.java


    private void skipSpace() throws Exception {
        while (!isEOT() && Character.isWhitespace(c())) {
            next();
        }
    }

This is the part that determines the constraint of the first character. In order, it determines whether it is the first character of the operator, number, and variable name.

Lexer.java


    private boolean isSignStart(char c) {
        return c == '=' || c == '+' || c == '-' || c == '*' || c == '/';
    }

    private boolean isDigitStart(char c) throws Exception {
        return Character.isDigit(c);
    }

    private boolean isVariableStart(char c) throws Exception {
        return Character.isAlphabetic(c);
    }

This is the part that breaks down into tokens. In order, it decomposes into tokens of operators, numbers, and variable names. The position of interest ʻi in the programmed string is by next () `. If you can break it down into tokens, you can probably proceed.

Lexer.java


    private Token sign() throws Exception {
        Token t = new Token();
        t.kind = "sign";
        t.value = Character.toString(next());
        return t;
    }

    private Token digit() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && Character.isDigit(c())) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "digit";
        t.value = b.toString();
        return t;
    }

    private Token variable() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && (Character.isAlphabetic(c()) || Character.isDigit(c()))) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "variable";
        t.value = b.toString();
        return t;
    }

Using the part that judges the constraint of the first character above and the part that decomposes it into tokens, Returns the first token found from the position of interest. First the space is skipped, The character at the position of interest determines the token and decomposes it into tokens.

Lexer.java


    public Token nextToken() throws Exception {
        skipSpace();
        if (isEOT()) {
            return null;
        } else if (isSignStart(c())) {
            return sign();
        } else if (isDigitStart(c())) {
            return digit();
        } else if (isVariableStart(c())) {
            return variable();
        } else {
            throw new Exception("Not a character for tokens");
        }
    }

Use the above nextToken () to decompose all the programmatic strings into Tokens.

Lexer.java


    public List<Token> tokenize() throws Exception {
        List<Token> tokens = new ArrayList<>();
        Token t = nextToken();
        while (t != null) {
            tokens.add(t);
            t = nextToken();
        }
        return tokens;
    }

Using the above implementation, the string that is the example program

 ans1 = 10 + 20 

Is lexically analyzed and printed to standard output.

Lexer.java


    public static void main(String[] args) throws Exception {
        String text = " ans1 = 10 + 20 ";
        List<Token> tokens = new Lexer().init(text).tokenize();
        for (Token token : tokens) {
            System.out.println(token.toString());
        }
        // --> variable "ans1"
        // --> sign "="
        // --> digit "10"
        // --> sign "+"
        // --> digit "20"
    }

}

That's all for the implementation. Thank you very much.

in conclusion

The source is available here.

Calc https://github.com/quwahara/Calc/tree/lexer/Calc/src/main/java

There is a continuation article.

** Implement simple parsing in Java ** http://qiita.com/quwahara/items/9bf468ff4286b28d2a24

Just in case, I'll give you a summary of class Lexer.

Lexer.java



import java.util.ArrayList;
import java.util.List;

public class Lexer {

    private String text;
    private int i;

    public Lexer init(String text) {
        i = 0;
        this.text = text;
        return this;
    }

    private boolean isEOT() {
        return text.length() <= i;
    }

    private char c() throws Exception {
        if (isEOT()) {
            throw new Exception("No more character");
        }
        return text.charAt(i);
    }

    private char next() throws Exception {
        char c = c();
        ++i;
        return c;
    }

    private void skipSpace() throws Exception {
        while (!isEOT() && Character.isWhitespace(c())) {
            next();
        }
    }

    private boolean isSignStart(char c) {
        return c == '=' || c == '+' || c == '-' || c == '*' || c == '/';
    }

    private boolean isDigitStart(char c) throws Exception {
        return Character.isDigit(c);
    }

    private boolean isVariableStart(char c) throws Exception {
        return Character.isAlphabetic(c);
    }

    private Token sign() throws Exception {
        Token t = new Token();
        t.kind = "sign";
        t.value = Character.toString(next());
        return t;
    }

    private Token digit() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && Character.isDigit(c())) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "digit";
        t.value = b.toString();
        return t;
    }

    private Token variable() throws Exception {
        StringBuilder b = new StringBuilder();
        b.append(next());
        while (!isEOT() && (Character.isAlphabetic(c()) || Character.isDigit(c()))) {
            b.append(next());
        }
        Token t = new Token();
        t.kind = "variable";
        t.value = b.toString();
        return t;
    }

    public Token nextToken() throws Exception {
        skipSpace();
        if (isEOT()) {
            return null;
        } else if (isSignStart(c())) {
            return sign();
        } else if (isDigitStart(c())) {
            return digit();
        } else if (isVariableStart(c())) {
            return variable();
        } else {
            throw new Exception("Not a character for tokens");
        }
    }

    public List<Token> tokenize() throws Exception {
        List<Token> tokens = new ArrayList<>();
        Token t = nextToken();
        while (t != null) {
            tokens.add(t);
            t = nextToken();
        }
        return tokens;
    }

    public static void main(String[] args) throws Exception {
        String text = " ans1 = 10 + 20 ";
        List<Token> tokens = new Lexer().init(text).tokenize();
        for (Token token : tokens) {
            System.out.println(token.toString());
        }
        // --> variable ,"ans1"
        // --> sign "="
        // --> digit "10"
        // --> sign "+"
        // --> digit "20"
    }

}

Recommended Posts

1 Implement simple lexical analysis in Java
2 Implement simple parsing in Java
3 Implement a simple interpreter in Java
Creating lexical analysis in Java 8 (Part 2)
Creating lexical analysis in Java 8 (Part 1)
Simple htmlspecialchars in Java
Implement two-step verification in Java
Topic Analysis (LDA) in Java
Implement Basic authentication in Java
Implement math combinations in Java
Implement Email Sending in Java
Implement functional quicksort in Java
Implement rm -rf in Java.
Implement XML signature in Java
Implement Table Driven Test in Java 14
Very simple input reception in Java
Morphological analysis in Java with Kuromoji
Implement simple login function in Rails
Implement reCAPTCHA v3 in Java / Spring
Implement PHP implode function in Java
A simple sample callback in Java
Try to implement Yubaba in Java
How to implement date calculation in Java
Try to implement n-ary addition in Java
How to implement coding conventions in Java
Implement something like a stack in Java
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
Pi in Java
FizzBuzz in Java
Static code analysis with Checkstyle in Java + Gradle
I tried to implement deep learning in Java
NLP4J [001b] Morphological analysis in Java (using kuromoji)
[java] sort in list
Read JSON in Java
Interpreter implementation in Java
I made a simple calculation problem game in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
Combine arrays in Java
"Hello World" in Java
Callable Interface in Java
I tried to implement Firebase push notification in Java
Comments in Java source
Format XML in Java
[Personal memo] Make a simple deep copy in Java
Boyer-Moore implementation in Java
Hello World in Java
Use OpenCV in Java
webApi memorandum in java
Type determination in Java
Ping commands in Java
Various threads in java
Heapsort implementation (in java)
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
Express failure in Java