Yesterday (6th day) was seiketkm's "[I developed a Robophone app that came from the future](http://qiita.com/seiketkm/items/ It was 46992f933294a7668dba) ". This article is the 7th day article of Tech-Circle Hands on Advent Calendar 2016.
This time, I would like to create an original language using PLY (lex + yacc), which is a Python lexical analysis / parsing library.
Speaking of the original language, TrumpScript, a programming language inspired by Donald Trump, was released before. https://github.com/samshadwell/TrumpScript
TrumpScript has the following features.
Such…. In this way, it is a language full of sense that faithfully reproduces Mr. Donald Trump.
Therefore, this time, in opposition to TrumpScript, "[PPAPScript](https://github.com/sakaro01/PPAPScript." I'm going to create "git)".
The specifications that I came up with are like this.
Before implementing PPAPScript, I will explain the ply used this time. ply is a Python library that implements lex and yacc in Python and puts them together as a module.
Installation of ply can be done with pip. It also supports python3.
$ pip install ply
From here, I will explain the minimum usage in lex.py and yacc.py.
This is an explanation of lex.py, which is responsible for lexical analysis.
import ply.lex as lex
tokens = (
'NUMBER',
'PLUS',
'MINUS',
'TIMES',
'DIVIDE',
'LPAREN',
'RPAREN',
)
There are two ways to define it. In either method, the naming convention for variable names and function names is defined in the form t_ (token name).
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
Define the regular expression on the first line of the function. A LexToken object is always passed as an argument. This will be the lexical object that matches. In the following example, the token value that matches the regular expression rule is converted to int type.
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
A special variable called t_ignore allows you to skip a string. Spaces and tabs are skipped in the example below.
t_ignore = ' \t'
You can define commenting regular expression rules by using a special variable called t_ignore_COMMENT.
t_ignore_COMMENT = r'\#.*'
The t_error function is called if no lexical match is found.
def t_error(t):
print("Illegal character '%s'" % t.value[0])
t.lexer.skip(t)
Build with lex (). This completes the preparation for lexical analysis.
lex.lex()
This is a description of yacc.py, which is responsible for parsing.
import ply.yacc as yacc
The following example defines an addition syntax rule.
def p_expression_minus(p):
'expression : expression PLUS term'
p[0] = p[1] - p[3]
def p_expression_minus(p):
'expression : expression MINUS term'
#Non-terminal symbol:Non-terminal symbol終端記号 非終端記号
def p_expression_minus(p):
'expression : expression MINUS term'
# p[0] p[1] p[2] p[3]
p[0] = p[1] - p[3]
def p_statement_assign(p):
"""statement : NAME EQUALS expression"""
names[p[1]] = p[3]
def p_expression_minus(p):
'expression : expression MINUS term'
p[0] = p[1] - p[3]
Similar syntax rules can be grouped together, as shown below.
def p_expression_binop(p):
"""expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression"""
if p[2] == '+':
p[0] = p[1] + p[3]
elif p[2] == '-':
p[0] = p[1] - p[3]
elif p[2] == '*':
p[0] = p[1] * p[3]
elif p[2] == '/':
p[0] = p[1] / p[3]
Similar to lex, it is called when no syntax rule is matched.
def p_error(p):
print "Syntax error in input"
Create a paraser object with yacc () and parse it with parser.parse (). Pass the string you want to parse as an argument.
parser = yacc.yacc()
parser.parse(data)
The implementation will be created based on the README Example in the ply repository. https://github.com/dabeaz/ply/blob/master/README.md
The flag is controlled by the part that executes yacc.parse ().
# Started flag is true by "PPAP" command
has_started = False
def parse(data, debug=0):
if data == "PPAP":
global has_started
has_started = True
print("Started PPAPScript!")
return
if has_started:
return yacc.parse(data, debug=debug)
else:
print('PPAPScript run by "PPAP" command.')
return
Create a bypass where the lexical analysis of the variable (t_NAME) catches "PPAP" so that the regular expression ignores "PPAP".
def t_NAME(t):
r"""(?!PPAP)[a-zA-Z_][a-zA-Z0-9_]*"""
return t
You can limit the variable name with a lex regular expression, but since you want to issue a dedicated error message, use the re module to handle the error.
def t_NAME(t):
r"""(?!PPAP)[a-zA-Z_][a-zA-Z0-9_]*"""
pattern = re.compile(r'^(apple|pineapple|pen)+', re.IGNORECASE)
if pattern.match(t.value):
return t
else:
print("This variable name can't be used '%s'.\n "
"Variable can use 'apple', 'pineapple', 'pen'." % t.value)
t.lexer.skip(t)
It is defined in def to prioritize lexical analysis. (Rex takes precedence in the order defined by def) In this case, the definition is required before t_NAME.
def t_DECLARE(t):
r"""I_have_(an|a)"""
return t
Both lex and yacc have ordinary definitions.
def t_PRINT(t):
r"""Ah!"""
return t
def p_statement_print_expr(p):
"""statement : PRINT expression"""
print(p[2])
The finished product is published in the following repository, so I will clone it. PPAPScript
$ git clone https://github.com/sakaro01/PPAPScript.git
Install ply.
$ pip install -r requirements.txt
Execute PPAPScript.
$ python ppapscript.py
Let's play interactively. (Currently only interactive)
Next time Tech-Circle Hands on Advent Calendar 2016 will be in charge of my synchronization Koga Yuta is. Probably a robot. It may be interesting to apply this article to create an original robot command language.
Recommended Posts