At work I needed to convert from one DSL to another, so I decided to write a compiler using pyparsing. First, in order to understand pyparsing, let's implement a parser using PL / 0 as an example.
PL / 0 is a language with a grammar similar to Pascal. It is a very small language specification for educational purposes. Actually, I also used it as a teaching material for compiler training when I was in my third year of college.
Below is a sample from Wikipedia.
ex1.pl0
VAR x, squ;
PROCEDURE square;
BEGIN
squ := x * x
END;
BEGIN
x := 1;
WHILE x <= 10 DO
BEGIN
CALL square;
! squ;
x := x + 1;
END
END.
To write a parser, you need a grammar definition. One of the grammar notations is [BNF](http://en.wikipedia.org/wiki/%E3%83%90%E3%83%83%E3%82%AB%E3%82%B9%E3 % 83% BB% E3% 83% 8A% E3% 82% A6% E3% 82% A2% E8% A8% 98% E6% B3% 95). The PL / 0 (E) BNF below was obtained from the linked Wikipedia. BNF consists of terminal symbols and production rules.
The PL / 0 BNF on Wikipedia did not contain an ident, so I added it.
Reserved words are in lowercase, but you'll need to capitalize to parse the sample above.
pl0_bnf.txt
ident = alpha { alpha | number | '_' } .
program = block "." .
block = [ "const" ident "=" number {"," ident "=" number} ";"]
[ "var" ident {"," ident} ";"]
{ "procedure" ident ";" block ";" } statement .
statement = [ ident ":=" expression | "call" ident |
"begin" statement {";" statement } "end" |
"if" condition "then" statement |
"while" condition "do" statement ].
condition = "odd" expression |
expression ("="|"#"|"<"|"<="|">"|">=") expression .
expression = [ "+"|"-"] term { ("+"|"-") term}.
term = factor {("*"|"/") factor}.
factor = ident | number | "(" expression ")".
It's hard to suddenly write a parser that completely satisfies the grammar. What should i do? It is a good idea to start near the terminal symbol, implement and test each minimum building block, and then combine and stack them.
Specifically, it is a reserved keyword, an identifier, an expression, a statement, a block, a declaration, and a program. Surprisingly, the basics such as variable declaration will be implemented near the end.
** It is very important to make a policy first. ** Implement a production rule If you find an unimplemented production rule while you're doing it, you're likely to get stuck in the depths of nested and recursive references if you implement it and go back and forth.
Let's implement it in the following order.
The reserved words for PL / 0 are const, var, procedure, call, begin, end, if, then, while, do, odd as far as they are read from BNF. pyparsing allows you to declare reserved words with Keyword ()
. Declare caseless reserved words with CaselessKeyword ().
CONST = CaselessKeyword('CONST')
VAR = CaselessKeyword('VAR')
:
Instead of writing, it seems that it is recommended to write as follows. Isn't there a more DRY way of writing ...
from pyparsing import CaselessKeyword, MatchFirst
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",","").split())
keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))
The result was recognized as CONST regardless of case.
>>> print keyword.parseString('CONST')
['CONST']
>>> print keyword.parseString('const')
['CONST']
The definition of an identifier generally starts with an alphabet, followed by an alphabet or number and _, so we will follow that. Also, reserved words cannot be identifiers, so write ~ keyword
at the beginning.
from pyparsing import Word, alphas, alphanums
ident = ~keyword + Word(alphas, alphanums+"_")
As a result of parsing valid identifiers, invalid identifiers, and reserved words, only valid identifiers succeeded.
>>> print repr(ident.parseString('valid_id'))
(['valid_id'], {})
>>> print repr(ident.parseString('0123bad_id'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
raise exc
pyparsing.ParseException: Expected W:(abcd...,abcd...) (at char 0), (line:1, col:1)
>>> print repr(ident.parseString('CONST'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
raise exc
pyparsing.ParseException: Found unwanted token, {"CONST" | "VAR" | "PROCEDURE" | "CALL" | "BEGIN" | "END" | "IF" | "THEN" | "WHILE" | "DO" | "ODD"} (at char 0), (line:1, col:1)
The PL / 0 formula is complete with the following BNF:
expression = [ "+"|"-"] term { ("+"|"-") term}.
term = factor {("*"|"/") factor}.
factor = ident | number | "(" expression ")"
The implementation of pyparsing looks like this: When the production rule is recursively referenced, use Forward
to prepare the box first. Use <<
to set the production rule later. Using oneOf means that the operators have the same precedence.
<<Inadvertently with the usual glue=If you use, you will get an error of unknown cause and you will be troubled with debugging. Also<<When using, the right side()Let's wrap it up with.x << y | z
Whenx << (y | z)
は演算子の優先順位に起因して計算結果がこWhenななるため。当然利用したいのは後者です。これもパース時に原因不明のエラーWhenなり、デバッグに悩まされます。pyparsingの仕様なので、そういうものだWhen思って気をつけるしかないです。
number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
term = Forward()
factor = Forward()
expression = Optional(oneOf("+ -")) + term + ZeroOrMore( oneOf("+ -") + term)
term << factor + ZeroOrMore(oneOf("* /") + factor)
factor << ident | number | "(" + expression + ")"
Let's test it. It seems that it is parsed correctly.
>>> expression.parseString('123')
(['123'], {})
>>> expression.parseString('123+456')
(['123', '+', '456'], {})
>>> expression.parseString('(x+y)*z')
(['(', 'x', '+', 'y', ')', '*', 'z'], {})
PL / 0 has a unary operator called "odd" for some reason. Other than that, it's just ordinary binary operators. pyparsing has a mechanism for generating a conditional expression parser called infixNotation, but here we will implement it faithfully to BNF.
condition = "odd" expression |
expression ("="|"#"|"<"|"<="|">"|">=") expression .
The implementation is as follows:
condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
test
>>> condition.parseString('odd 1')
(['ODD', '1'], {})
>>> condition.parseString('3 <= 1')
(['3', '<=', '1'], {})
The BNF for assignment is ʻident ":=" expression`. It's easy because both ident and expression are already implemented.
assign_statement = ident + ":=" + expression
Since the same thing is repeated, we will only implement it from here.
call_statement = CALL + ident
Define an IF statement. The statement that represents the whole sentence has not appeared yet, so I will declare it in Forward.
statement = Forward()
if_statement = IF + condition + THEN + statement
while_statement = WHILE + condition + DO + statement
Now that we have all the parts, let's finally define the sentence generation rules. BNF is as follows.
statement = [ ident ":=" expression | "call" ident |
"begin" statement {";" statement } "end" |
"if" condition "then" statement |
"while" condition "do" statement ].
Implement with pyparsing as it is.
statement = Optional(assign_statement
| call_statement
| BEGIN + statement + ZeroOrMore(";" + statement) + END
| if_statement
| while_statement
)
I'm getting tired, so I just implemented it.
const = CONST + ident + "=" + number + ZeroOrMore("," + ident + "=" + number) + ";"
var = VAR + ident + ZeroOrMore("," + ident) + ";"
Focus only on this part of BNF " procedure "ident"; "block"; "
. The outer repetition etc. is performed when the block is implemented.
block = Forward()
procedure = PROCEDURE + ident + ";" + block + ";"
block
block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement
program This is the top-level production rule. Thank you for your hard work.
program = block + "."
Let's parse the sample program posted at the beginning.
$ python pl0_parser.py ex1.pl0
Traceback (most recent call last):
File "pl0_parser.py", line 64, in <module>
print program.parseString(txt)
File "/usr/lib/python2.7/dist-packages/pyparsing.py", line 1041, in parseString
raise exc
pyparsing.ParseException: Expected "." (at char 59), (line:8, col:1)
It doesn't pass. Why? Since it says line: 8, col: 1, BEGIN-END cannot be parsed, and it seems that he is expecting a "." At the end of the program. Can't you get a little more kind error around here? I will test it only with the statement parser for confirmation.
>>> statement.parseString('''\
... BEGIN
... x := 1;
... WHILE x <= 10 DO
... BEGIN
... CALL square;
... ! squ;
... x := x + 1;
... END
... END
... ''')
([], {})
The result is empty and certainly fails. Failures that do not crash are cumbersome to debug. If you look closely, you will find an unfamiliar sentence, ! Squ;
. This is an extended PL / 0 syntax and is not defined in the BNF of this implementation. Delete it and run it again.
$ python pl0_parser.py ex1.pl0
['VAR', 'x', ',', 'squ', ';', 'PROCEDURE', 'square', ';', 'BEGIN', 'squ', ':=', 'x', '*', 'x', 'END', ';', 'BEGIN', 'x', ':=', '1', ';', 'WHILE', 'x', '<=', '10', 'DO', 'BEGIN', 'CALL', 'square', ';', 'x', ':=', 'x', '+', '1', ';', 'END', 'END', '.']
It went well!
Next time, I would like to make a syntax tree (AST) in earnest and do code generation.
Finally, I will post the source of the parser that I have implemented so far.
pl0_parser.py
from pyparsing import CaselessKeyword, MatchFirst, Word, alphas, alphanums, Forward, Optional, oneOf, ZeroOrMore, Regex
# 1. reserved keyword
(CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD) = map(CaselessKeyword,
"CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD".replace(",", "").split())
keyword = MatchFirst((CONST, VAR, PROCEDURE, CALL, BEGIN, END, IF, THEN, WHILE, DO, ODD))
# 2. identifier
ident = ~keyword + Word(alphas, alphanums + "_")
# 3. expression
number = Regex(r"\d+(\.\d*)?([eE][+-]?\d+)?")
term = Forward()
factor = Forward()
expression = Optional(oneOf("+ -")) + term + ZeroOrMore(oneOf("+ -") + term)
term << (factor + ZeroOrMore(oneOf("* /") + factor))
factor << (ident | number | "(" + expression + ")")
# 4. condition
condition = ODD + expression | expression + oneOf('= # < <= > >=') + expression
# 5. assignment
assign_statement = ident + ":=" + expression
# 6. call
call_statement = CALL + ident
# 7. if-then
statement = Forward()
if_statement = IF + condition + THEN + statement
# 8. while-do
while_statement = WHILE + condition + DO + statement
# 9. statement
statement << Optional(assign_statement
| call_statement
| BEGIN + statement + ZeroOrMore(";" + statement) + END
| if_statement
| while_statement
)
# 10. const
const = CONST + ident + "=" + number + ZeroOrMore("," + ident + "=" + number) + ";"
# 11. var
var = VAR + ident + ZeroOrMore("," + ident) + ";"
# 12. procedure
block = Forward()
procedure = PROCEDURE + ident + ";" + block + ";"
# 13. block
block << Optional(const) + Optional(var) + ZeroOrMore(procedure) + statement
# 14. program
program = block + "."
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'r') as fp:
txt = fp.read()
print program.parseString(txt)
Recommended Posts