[JAVA] Kinx Library-Parser Combinator (Extra Edition)

Kinx Library-Parser Combinator (Extra Edition)

Introduction

** "Looks like JavaScript, brain (contents) is Ruby, (stability is AC / DC)" ** Scripting language Kinx ). This time is also a parser combinator.

Not long ago, I ran it with Kinx Library-Parser Combinator (Part 2). What is this time?

Yes, I would like to introduce the interface. I'm sorry, I forgot to do it last time.

Parsek

Instances of the Parsek class have the following methods: The $ in the statement indicates the actual Parsek instance that was instantiated.

Predefined parser

The following predefined parsers already exist in the instance of the Parsek class.

Predefined parser Contents
$.eof A parser that succeeds if it is EOF.
$.any A parser that succeeds with any single character.
$.all A parser that processes everything until the end and succeeds.
$.index Returns the current position ({ offset, line, column}). offset is 0 starting point, line/column is one starting point.
$.letter $.regex(/[a-zA-Z]/)Same as.
$.letters $.regex(/[a-zA-Z]*/)Same as.
$.digit $.regex(/[0-9]/)Same as.
$.digits $.regex(/[0-9]*/)Same as.
$.whitespace $.regex(/\s+/)Same as.
$.optWhitespace $.regex(/\s*/)Same as.
$.cr $.string("\r")Same as.
$.lf $.string("\n")Same as.
$.crlf $.string("\r\n")Same as.
$.newline $.alt($.crlf, $.lf, $.cr)Same as.
$.end $.alt($.newline, $.eof)Same as.

Parser generation function

An instance of the Parsek class has the ability to generate a parser with the following methods:

Parser generation function Contents
$.string(str) strCreates a parser that succeeds if it matches the string specified in.
$.regex(re, groupIndex) reCreates a parser that succeeds if it matches the regular expression specified in. value isgroupIndexContains the capture result of the group specified in.
$.succeed(result) resultTo create a parser that always succeeds.
$.fail(message) messageCreate a parser that fails by leaving a message.
$.oneOf(chars) charsCreate a parser that succeeds if it matches one character contained in.
$.noneOf(chars) charsCreate a parser that succeeds if it does not match one character contained in.
$.sepBy(content, separator) contentA string that can be parsed withseparatorCreate a parser that returns the result divided by the result that matches.
$.sepBy1(content, separator) sepByIs allowed 0,sepBy1Fails if at least one does not match.
$.lazy(description, f) fSpecifies a function that returns a parser to, and creates a parser when it is first used.
$.alt(...parsers) Creates a parser that succeeds if it matches any of the specified parsers. The application is done in the order specified.
$.takeWhile(predicate) The next character ispredicateCreate a parser that will continue to succeed while it is true.
$.seq(...parsers) Creates a parser that succeeds when multiple specified parsers are matched in sequence. Results are returned as an array of all results.
$.seqMap(...parsers, mapf) Creates a parser that succeeds when multiple specified parsers are matched in sequence. The result is an array of all the results, but it returns the value returned by the function specified by mapf.

ParsekParser

The parser generated by the method of the Parsek class is an instance of the ParsekParser class. This class defines methods for chaining parsers.

Execution of perspective

Once all the parsers are ready, run the parse with the following method.

Method Contents
parser.parseAll(target) targetAgainstparserTo parse using.

Parser generation

Parser generation function Contents
parser.desc(description) parserSet the message to be displayed when parsing by is unsuccessful.
parser.or(nextParser) parserIf parsing bynextParserCreate a parser that attempts to.
parser.and(nextParser) parserIf you succeed in parsing bynextParserCreate a parser to execute. The result is returned as an array.
parser.then(nextParser) parserIf you succeed in parsing bynextParserCreate a parser to execute. Result isnextParserReturn the oneparserThe result of is ignored.
parser.skip(nextParser) parserIf you succeed in parsing bynextParserCreate a parser to execute. Result isparserReturn the onenextParserThe result of is ignored.
parser.many() parserCreate a parser that parses the occurrence of 0 or more times.
parser.many1() parserCreate a parser that parses that appears more than once.
parser.times(min, max) parserButminMore than oncemaxCreate a parser that parses the appearance repeatedly no more than once. ..maxIf is omitted, justminSuccess in case of number of times.
parser.sepBy(separator) parserA string that can be parsed withseparatorCreate a parser that returns the result divided by the result that matches.
parser.sepBy1(separator) sepByIs allowed 0,sepBy1Fails if at least one does not match.
parser.map(f) parserUse the result of the above converted by the function f.
parser.chain(f) parserPass the result of to the function f, and connect the parsers to use the new parser returned by f as the next parser.
parser./(nextParser) parser.or(nextParser)Another name for.
parser.+(nextParser) nextParserWithout specifyingparser.+()When used asparser.many1()Another name for.nextParserIf you specifyparser.and(nextParser)Another name for.
parser.*() parser.many()Another name for.

Packrat Parsing

Supports Packrat Parsing.

Parser combinators usually proceed with analysis in the backtrack. So, for example, in the case of $ .alt (a, b, c), if the parser of ʻa fails, you can proceed with b`, but the same parser may be repeated many times along the way.

Therefore, when backtracking, performance improvement can be expected by caching some results and reusing them by using the property that ** if analyzed in the same place and under the same conditions, the same result will be obtained **. .. It's memoization.

Parsek supports this feature. By the way, it can be disabled by specifying {disablePackratParsing: true} in the constructor when creating a Parsek instance.

Let's see its performance with a benchmark. Refer to here to handle the grammar that causes a tremendous number of backtracks. The grammar is as follows.

S <- A
A <- P "+" A / P "-" A / P
P <- "(" A ")" / "1"

This will parse the following string:

(((((((((((((1)))))))))))))

The explanation is easy to understand, so I will quote it as it is.

The essence of this rule and string combination is that it must go through all A and P combinations before being finally parsed. In other words, the worst case of backtrack can be almost reproduced. So if you frankly backtrack, the processing time should increase exponentially with the number of parentheses correspondence.

No, it's amazing.

Now, the benchmark source code. Since the parser itself is the same, the same test is done with the $ D parser of{disablePackratParsing: true}and the $ E parser enabled.

using Parsek;

var $E = new Parsek();
var $D = new Parsek({ disablePackratParsing: true });
var tmr = new SystemTimer();

// rule:
//     S <- A
//     A <- P "+" A / P "-" A / P
//     P <- "(" A ")" / "1"
// input:
//     (((((((((((((1)))))))))))))

function generateParser($) {
    var S, A, P;
    S = $.lazy(&() => A);
    A = $.lazy(&() => $.alt($.seq(P, $.string('+'), A), $.seq(P, $.string('-'), A), P));
    P = $.alt($.seq($.string('('), A, $.string(')')), $.string('1'));
    return S;
}

function test($) {
    for (var i = 0; i <= 10; ++i) {
        tmr.restart();
        var r = generateParser($).parseAll(('(' * i) + '1' + (')' * i));
        var elapsed = tmr.elapsed();
        System.println(["%8.5f" % elapsed, r]);
    }
}

test($E);
test($D);

Let's do it!

If valid


[" 0.00090", {"position":1,"status":1,"value":"1"}]
[" 0.00041", {"position":3,"status":1,"value":["(","1",")"]}]
[" 0.00044", {"position":5,"status":1,"value":["(",["(","1",")"],")"]}]
[" 0.00061", {"position":7,"status":1,"value":["(",["(",["(","1",")"],")"],")"]}]
[" 0.00083", {"position":9,"status":1,"value":["(",["(",["(",["(","1",")"],")"],")"],")"]}]
[" 0.00055", {"position":11,"status":1,"value":["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"]}]
[" 0.00061", {"position":13,"status":1,"value":["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"]}]
[" 0.00071", {"position":15,"status":1,"value":["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00200", {"position":17,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00101", {"position":19,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 0.00196", {"position":21,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"],")"]}]

When disabled


[" 0.00022", {"position":1,"status":1,"value":"1"}]
[" 0.00034", {"position":3,"status":1,"value":["(","1",")"]}]
[" 0.00097", {"position":5,"status":1,"value":["(",["(","1",")"],")"]}]
[" 0.00353", {"position":7,"status":1,"value":["(",["(",["(","1",")"],")"],")"]}]
[" 0.01142", {"position":9,"status":1,"value":["(",["(",["(",["(","1",")"],")"],")"],")"]}]
[" 0.02686", {"position":11,"status":1,"value":["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"]}]
[" 0.09525", {"position":13,"status":1,"value":["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"]}]
[" 0.26821", {"position":15,"status":1,"value":["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"]}]
[" 0.72229", {"position":17,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 2.44629", {"position":19,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"]}]
[" 7.48334", {"position":21,"status":1,"value":["(",["(",["(",["(",["(",["(",["(",["(",["(",["(","1",")"],")"],")"],")"],")"],")"],")"],")"],")"],")"]}]

If Packrat Parsing is enabled, 10 parentheses will complete within 2 milliseconds. On the other hand, when disabled, it takes 7.5 seconds when the last 10 parentheses are reached.

in conclusion

Parser combinator, it's convenient to try this. I haven't used it much, but I think I'll use it a little more in the future. You can easily parse strings that cannot be expressed by regular expressions.

You can use it to create a language processing system. Kinx uses yacc for parsing and handwriting for lexical analysis.

Originally, I like parsing itself and handwritten recursive descent parser, but I use yacc exclusively because it is difficult to change the grammar definition. However, the parser combinator can also perform lexical analysis, which is convenient.

see you!

Recommended Posts

Kinx Library-Parser Combinator (Extra Edition)
Kinx Library-Parser Combinator (Part 1)
Kinx Library-Parser Combinator (Part 2)
Kinx Library-JIT Compiler Library (Extra Edition)
Let's try Zoomdata (extra edition)