2 Implement simple parsing in Java

Introduction

There is parsing in the process of the compiler implementation. Continuing from Implementing simple lexical analysis in Java, I would like to implement a very primitive parsing in Java.

What you want to do with parsing

Confirm what you want to do in implementation. Suppose that the character string in the program is an arithmetic operation. Parsing analyzes the order in which the four arithmetic operations are calculated. Consider the following formula as an example.

a = 3 + 4 * 5

As you can see at a glance, the calculation is done from the part in parentheses in the order below.

  1. (4 * 5)
  2. (3 + (4 * 5))
  3. (a = (3 + (4 * 5)))

Parsing does this ordering. In an actual program, there are function definitions and calls. For the sake of simplicity, we first aim to be able to analyze the four arithmetic operations and assignments to variables.

How to express ordering

I understand that I want to order. Think about how to express the ordered state in the implementation. Let's start with a part of the example formula, 4 * 5. In Previous article, I decomposed the expression into tokens. When 4 * 5 is decomposed into tokens, it becomes 4, *, and 5. Here, in order to express the state of 4 * 5, we added a field variable so that we can see that we have a 4 token on the left and a 5 token on the right for the*token. I'll give it to you. Specifically, I will add left and right to the implementation of the Token class in the previous article.

Token.java



public class Token {

    public String kind;
    public String value;
    public Token left;
    public Token right;

    @Override
    public String toString() {
        return kind + " \"" + value + "\"";
    }

    public String paren() {
        if (left == null && right == null) {
            return value;
        } else {
            StringBuilder b = new StringBuilder();
            b.append("(");
            if (left != null) {
                b.append(left.paren()).append(" ");
            }
            b.append(value);
            if (right != null) {
                b.append(" ").append(right.paren());
            }
            b.append(")");
            return b.toString();
        }
    }

}

Now you can express the state of 4 * 5 by giving the left of the * token the 4 token and the right the 5 token. Next, the state of 3 + (4 * 5) is expressed. In the same way, let the + token left have the 3 token and the right have the*token. Here the * token has the 4 and 5 tokens in the left and right in the previous description. In other words, it is 4 * 5. In summary, there are + tokens, 3 tokens in left, 4 * 5 tokens in right, and the state of 3 + (4 * 5) can be expressed. On a different note, I also added paren (), which represents the state in parentheses, to the Token class to check the contents.

How to order

Next, let's consider how to do this ordering. In the example expression, three operators appear. There are three: =, +, and *. It is these operators that are the basis of the order. In the order I checked for what I wanted to do in parsing, * was number 1, + was number 2, and = was number 3. Here, the operator that comes first in this order is assigned a numerical value of the degree that comes first. Specifically, assign 60 to*, 50 to +, and 10 to=. Based on this degree, we will put parentheses first from the one with the largest number. Since * is 60 and the degree is the largest, it is bracketed first, = Is 10, which is the smallest degree, so it will be parenthesized at the end.

Try to implement in Java

Move on to implementation. Let's take a look at some of the classes that do parsing.

The first defines how it works with respect to the meaning of the token. degrees defines the degrees of operator order. factorKinds defines the kind of tokens at the left and right ends of the expression, such as the ʻa token and the 3token. binaryKinds defines the kindof the token that comes in the middle of the expression, such as the=token or the+token. rightAssocs is an ordering to = `tokens that needs special treatment and is the definition for that.

Parser.java


public class Parser {

    private Map<String, Integer> degrees;
    private List<String> factorKinds;
    private List<String> binaryKinds;
    private List<String> rightAssocs;

    public Parser() {
        degrees = new HashMap<>();
        degrees.put("*", 60);
        degrees.put("/", 60);
        degrees.put("+", 50);
        degrees.put("-", 50);
        degrees.put("=", 10);
        factorKinds = Arrays.asList(new String[] { "digit", "variable" });
        binaryKinds = Arrays.asList(new String[] { "sign" });
        rightAssocs = Arrays.asList(new String[] { "=" });
    }

This is the initialization part before parsing. You will receive a list of tokens decomposed by lexical analysis. Parse this token list. Since it is convenient to handle the termination in parsing, add the ʻeob token that represents the termination to the end of the token list. Also, set the index ʻi to 0 to refer to the token list from the beginning to the end.

Parser.java


    private List<Token> tokens;
    private int i;
    
    public Parser init(List<Token> tokens) {
        i = 0;
        this.tokens = new ArrayList<Token>(tokens);
        Token eob = new Token();
        eob.kind = "eob";
        eob.value = "(eob)";
        this.tokens.add(eob);
        return this;
    }

It is a function to refer to the token list in order. token () is the token you are currently looking at in the token list. next () returns the token of interest and advances the position of interest to the next.

Parser.java


    private Token token() throws Exception {
        if (tokens.size() <= i) {
            throw new Exception("No more token");
        }
        return tokens.get(i);
    }

    private Token next() throws Exception {
        Token t = token();
        ++i;
        return t;
    }

I will go into the explanation of the part to be analyzed. lead () parses the tokens at the left and right ends of the expression, such as ʻa and 3`, and returns that token.

Parser.java


    private Token lead(Token token) throws Exception {
        if (factorKinds.contains(token.kind)) {
            return token;
        } else {
            throw new Exception("The token cannot place there.");
        }
    }

degree () returns the token priority. bind () allocates the left and right tokens to the operator token and returns the assigned operator token.

Parser.java


    private int degree(Token t) {
        if (degrees.containsKey(t.value)) {
            return degrees.get(t.value);
        } else {
            return 0;
        }
    }

    private Token bind(Token left, Token operator) throws Exception {
        if (binaryKinds.contains(operator.kind)) {
            operator.left = left;
            int leftDegree = degree(operator);
            if (rightAssocs.contains(operator.value)) {
                leftDegree -= 1;
            }
            operator.right = expression(leftDegree);
            return operator;
        } else {
            throw new Exception("The token cannot place there.");
        }
    }

Let's take a closer look at bind (). bind () takes the token to the left of the expression and the operator token of the expression as arguments. bind () first assigns the token to the left to the operator token ʻoperator.left. The token assigned to ʻoperator.right assigns the return value that called ʻexpression (). Let's explain what token ʻexpression () returns. When calling ʻexpression (), pass the precedence of the operator token as an argument. The degree passed is compared in ʻexpression () with the degree of the operator that follows in the token list. If more or less passed, ʻexpression ()returns the token before the operator token that follows. For example, in the formula below, consider the case wherebind ()is used as an argument,left is 6, and ʻoperator is+.

6 + 7 - 8

ʻExpression ()is called with a degree of+of50 and is compared to the subsequent degree of -``50. Since the degree is the same, it returns the token 7 that precedes -. Then it returns to bind ()and7 is assigned to ʻoperator.right. Now bind () returns the+associated with 6 + 7. If leftDegree is small in the degree comparison, it will be explained later.

ʻExpression () uses operator precedence to control token association. ʻExpress () calls lead () and bind () as described above.

Parser.java


    public Token expression(int leftDegree) throws Exception {
        Token left = lead(next());
        int rightDegree = degree(token());
        while (leftDegree < rightDegree) {
            Token operator = next();
            left = bind(left, operator);
            rightDegree = degree(token());
        }
        return left;
    }

Let's take the movement of ʻexpression ()` as an example of some token lists.

If the token list is only ʻa`

The first call to ʻexpression ()callsleftDegree with 0. For ʻa only, a call tolead ()returns ʻa and left is determined by ʻa. The next degree () returns the degree of(eob)added to make it easier to terminate, and rightDegree is 0. leftDegree is 0, rightDegree is 0, while is not established, and left ʻa is returned as it is. In other words, the token list of only ʻa could be parsed.

If the token list is ʻa = 3`

If ʻa = 3, the call to lead () returns ʻa and left is determined by ʻa. The next degree () returns the degree of = and rightDegreeis10. As in the previous case, ʻexpression () holds while when leftDegree is called with 0. Then it calls bind () with ʻaand=as arguments. As explained inbind (), bind () calls ʻexpression () to parse the token on the right side of the expression. When calling ʻexpression ()withbind (), it is possible to parse up to ʻa =, so the only token that remains in the token list is 3. This is the same situation as "when the token list is only ʻa" explained earlier. That is, ʻexpression () called by bind () returns 3, and ʻoperator.right is determined by 3. bind () returns = to the caller's ʻexpression (), The left of the caller's ʻexpression ()is determined by=. The degree ()on the line following thebind ()call returns the degree0 of (eob) , so exit while. With this, ʻexpression () returns the left determined to be=, and the analysis is completed. This explanation is also the explanation when leftDegree is small in the degree comparison that was explained later.

This is the final explanation of the part to be analyzed. block () calls ʻexpression ()until the token list becomes(eob)and adds the parsed result to the list. The analysis result is added toblk` as many as the number of expressions in the token list.

Parser.java


    public List<Token> block() throws Exception {
        List<Token> blk = new ArrayList<Token>();
        while (!token().kind.equals("eob")) {
            blk.add(expression(0));
        }
        return blk;
    }

Using the above implementation, the string that is the example program

a = 3 + 4 * 5

Parses and prints to standard output.

Parser.java


    public static void main(String[] args) throws Exception {
        String text = "a = 3 + 4 * 5";
        List<Token> tokens = new Lexer().init(text).tokenize();
        List<Token> blk = new Parser().init(tokens).block();
        for (Token ast : blk) {
            System.out.println(ast.paren());
        }
        // --> (a = (3 + (4 * 5)))
    }
}

That's all for the implementation. Thank you very much.

in conclusion

The source is available here.

Calc https://github.com/quwahara/Calc/tree/article-2-parser/Calc/src/main/java

There is a continuation article.

** Implement a simple interpreter in Java ** http://qiita.com/quwahara/items/30e93dfd2690913d66c0

Finally, I will give you a summary of the Parser classes.

Parser.java



import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Parser {

    private Map<String, Integer> degrees;
    private List<String> factorKinds;
    private List<String> binaryKinds;
    private List<String> rightAssocs;

    public Parser() {
        degrees = new HashMap<>();
        degrees.put("*", 60);
        degrees.put("/", 60);
        degrees.put("+", 50);
        degrees.put("-", 50);
        degrees.put("=", 10);
        factorKinds = Arrays.asList(new String[] { "digit", "variable" });
        binaryKinds = Arrays.asList(new String[] { "sign" });
        rightAssocs = Arrays.asList(new String[] { "=" });
    }

    private List<Token> tokens;
    private int i;
    
    public Parser init(List<Token> tokens) {
        i = 0;
        this.tokens = new ArrayList<Token>(tokens);
        Token eob = new Token();
        eob.kind = "eob";
        eob.value = "(eob)";
        this.tokens.add(eob);
        return this;
    }

    private Token token() throws Exception {
        if (tokens.size() <= i) {
            throw new Exception("No more token");
        }
        return tokens.get(i);
    }

    private Token next() throws Exception {
        Token t = token();
        ++i;
        return t;
    }

    private Token lead(Token token) throws Exception {
        if (factorKinds.contains(token.kind)) {
            return token;
        } else {
            throw new Exception("The token cannot place there.");
        }
    }

    private int degree(Token t) {
        if (degrees.containsKey(t.value)) {
            return degrees.get(t.value);
        } else {
            return 0;
        }
    }

    private Token bind(Token left, Token operator) throws Exception {
        if (binaryKinds.contains(operator.kind)) {
            operator.left = left;
            int leftDegree = degree(operator);
            if (rightAssocs.contains(operator.value)) {
                leftDegree -= 1;
            }
            operator.right = expression(leftDegree);
            return operator;
        } else {
            throw new Exception("The token cannot place there.");
        }
    }

    public Token expression(int leftDegree) throws Exception {
        Token left = lead(next());
        int rightDegree = degree(token());
        while (leftDegree < rightDegree) {
            Token operator = next();
            left = bind(left, operator);
            rightDegree = degree(token());
        }
        return left;
    }

    public List<Token> block() throws Exception {
        List<Token> blk = new ArrayList<Token>();
        while (!token().kind.equals("eob")) {
            blk.add(expression(0));
        }
        return blk;
    }

    public static void main(String[] args) throws Exception {
        String text = "a = 3 + 4 * 5";
        List<Token> tokens = new Lexer().init(text).tokenize();
        List<Token> blk = new Parser().init(tokens).block();
        for (Token ast : blk) {
            System.out.println(ast.paren());
        }
        // --> (a = (3 + (4 * 5)))
    }
}

Recommended Posts

2 Implement simple parsing in Java
1 Implement simple lexical analysis in Java
Simple htmlspecialchars in Java
Implement two-step verification in Java
Implement Basic authentication in Java
Implement Email Sending in Java
Implement functional quicksort in Java
Implement rm -rf in Java.
Implement XML signature in Java
Implement Table Driven Test in Java 14
Very simple input reception in Java
Implement simple login function in Rails
Implement reCAPTCHA v3 in Java / Spring
Implement PHP implode function in Java
A simple sample callback in Java
Parsing the COTOHA API in Java
Try to implement Yubaba in Java
How to implement Kalman filter in Java
Implement API Gateway Lambda Authorizer in Java Lambda
Partization in Java
Try to implement n-ary addition in Java
Changes in Java 11
Rock-paper-scissors in Java
Implement something like a stack in Java
Pi in Java
FizzBuzz in Java
I tried to implement deep learning in Java
Try using the COTOHA API parsing in Java
[java] sort in list
Read JSON in Java
Interpreter implementation in Java
Make Blackjack in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
Combine arrays in Java
"Hello World" in Java
Comments in Java source
Azure functions in java
Format XML in Java
Boyer-Moore implementation in Java
Hello World in Java
Use OpenCV in Java
Type determination in Java
Various threads in java
Heapsort implementation (in java)
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
POST JSON in Java
Express failure in Java
Implement CustomView in code
Create JSON in Java
Date manipulation in Java 8
What's new in Java 8
Use PreparedStatement in Java
What's new in Java 9,10,11
Parallel execution in Java
Initializing HashMap in Java
Implement markdown in Rails