Yesterday (6th day) was seiketkm's "[I developed a Robophone app that came from the future](http://qiita.com/seiketkm/items/ It was 46992f933294a7668dba) ". This article is the 7th day article of Tech-Circle Hands on Advent Calendar 2016.

This time, I would like to create an original language using PLY (lex + yacc), which is a Python lexical analysis / parsing library.

Speaking of the original language, TrumpScript, a programming language inspired by Donald Trump, was released before. https://github.com/samshadwell/TrumpScript

TrumpScript has the following features.

Floating point types cannot be used, only integers. America doesn't do half-hearted things.
The number must be greater than 1 million. Numbers smaller than that are insignificant.
import cannot be used. All codes must be made in the USA.

Such…. In this way, it is a language full of sense that faithfully reproduces Mr. Donald Trump.

Therefore, this time, in opposition to TrumpScript, "[PPAPScript](https://github.com/sakaro01/PPAPScript." I'm going to create "git)".

PPAP Script specification

Be sure to start the program with "PPAP"
Only the combination of "pen", "pineapple" and "apple" can be used (case is ignored)
When declaring a variable, always prefix the variable with "I_have_a" or "I_have_an" (eg I_have_a pen = 10)
The output function is "Ah!" (The one who is groaning when coalescing?) (Example: Ah! Apple + pen)
Ordinary four arithmetic operations
Ordinary comment out (# apple + pen)

The specifications that I came up with are like this.

What is ply

Before implementing PPAPScript, I will explain the ply used this time. ply is a Python library that implements lex and yacc in Python and puts them together as a module.

lex: A tool for lexical analysis
yacc: A tool for parsing

Introduction method

Installation of ply can be done with pip. It also supports python3.

$ pip install ply

From here, I will explain the minimum usage in lex.py and yacc.py.

Explanation of lex.py

This is an explanation of lex.py, which is responsible for lexical analysis.

1. Import lex.

import ply.lex as lex

2. Define the words you want to parse in a variable called "tokens" in tuple format.

tokens = (
    'NUMBER',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
)

3. Define a regular expression lexical analysis rule.

There are two ways to define it. In either method, the naming convention for variable names and function names is defined in the form t_ (token name).

Definition of simple lexical analysis rules

t_PLUS   = r'\+'
t_MINUS  = r'-'
t_TIMES  = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

When processing during lexical analysis

Define the regular expression on the first line of the function. A LexToken object is always passed as an argument. This will be the lexical object that matches. In the following example, the token value that matches the regular expression rule is converted to int type.

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

4. Skip unnecessary strings.

A special variable called t_ignore allows you to skip a string. Spaces and tabs are skipped in the example below.

t_ignore = ' \t'

5. Define the syntax for destroying tokens.

You can define commenting regular expression rules by using a special variable called t_ignore_COMMENT.

t_ignore_COMMENT = r'\#.*'

6. Define error handling.

The t_error function is called if no lexical match is found.

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(t)

7. Build.

Build with lex (). This completes the preparation for lexical analysis.

lex.lex()

Explanation of yacc.py

This is a description of yacc.py, which is responsible for parsing.

1. Import yacc.

import ply.yacc as yacc

Note that the tokens defined by lex are loaded at this point.

2. Write the parsing rule.

The following example defines an addition syntax rule.

def p_expression_minus(p):
    'expression : expression PLUS term'
    p[0] = p[1] - p[3]

The following are the rules for defining.

Function naming conventions start with p_.
Syntax rules are defined in the first line of the function as a documentation string.
Syntax rules are non-terminal symbols: non-terminal symbols or combinations of terminal symbols

def p_expression_minus(p):
    'expression : expression MINUS term'
    #Non-terminal symbol:Non-terminal symbol終端記号 非終端記号

An array of symbols defined in the syntax rule is passed to the argument of the defined function. The indexes correspond in order from the left.
By assigning to p [0], the value will be returned toward the start symbol.

def p_expression_minus(p):
    'expression : expression MINUS term'
    #  p[0]         p[1]     p[2] p[3]
 
    p[0] = p[1] - p[3]

The non-terminal symbol of the function defined first is the start symbol. In the following cases, statement becomes the start symbol.

def p_statement_assign(p):
    """statement : NAME EQUALS expression"""
    names[p[1]] = p[3]


def p_expression_minus(p):
    'expression : expression MINUS term'
 
    p[0] = p[1] - p[3]

3. Synthesize the syntax rules.

Similar syntax rules can be grouped together, as shown below.

def p_expression_binop(p):
    """expression : expression PLUS expression
                  | expression MINUS expression
                  | expression TIMES expression
                  | expression DIVIDE expression"""
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]
    elif p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]

4. Define error handling.

Similar to lex, it is called when no syntax rule is matched.

def p_error(p):
    print "Syntax error in input"

5. Parse.

Create a paraser object with yacc () and parse it with parser.parse (). Pass the string you want to parse as an argument.

parser = yacc.yacc()
parser.parse(data)

Implement PPAP Script

The implementation will be created based on the README Example in the ply repository. https://github.com/dabeaz/ply/blob/master/README.md

Start the program with "PPAP"

The flag is controlled by the part that executes yacc.parse ().

# Started flag is true by "PPAP" command
has_started = False

def parse(data, debug=0):
    if data == "PPAP":
        global has_started
        has_started = True
        print("Started PPAPScript!")
        return

    if has_started:
        return yacc.parse(data, debug=debug)
    else:
        print('PPAPScript run by "PPAP" command.')
        return

Create a bypass where the lexical analysis of the variable (t_NAME) catches "PPAP" so that the regular expression ignores "PPAP".

def t_NAME(t):
    r"""(?!PPAP)[a-zA-Z_][a-zA-Z0-9_]*"""
    return t

Only the combination of "pen", "pineapple" and "apple" can be used (case is ignored)

You can limit the variable name with a lex regular expression, but since you want to issue a dedicated error message, use the re module to handle the error.

def t_NAME(t):
    r"""(?!PPAP)[a-zA-Z_][a-zA-Z0-9_]*"""
    pattern = re.compile(r'^(apple|pineapple|pen)+', re.IGNORECASE)
    if pattern.match(t.value):
        return t
    else:
        print("This variable name can't be used '%s'.\n "
              "Variable can use 'apple', 'pineapple', 'pen'." % t.value)
        t.lexer.skip(t)

Be sure to add "I_have_a" or "I_have_an" to the variable declaration assignment.

It is defined in def to prioritize lexical analysis. (Rex takes precedence in the order defined by def) In this case, the definition is required before t_NAME.

def t_DECLARE(t):
    r"""I_have_(an|a)"""
    return t

The output function is "Ah!"

Both lex and yacc have ordinary definitions.

def t_PRINT(t):
    r"""Ah!"""
    return t

def p_statement_print_expr(p):
    """statement : PRINT expression"""
    print(p[2])

Executing PPAP Script

The finished product is published in the following repository, so I will clone it. PPAPScript

$ git clone https://github.com/sakaro01/PPAPScript.git

Install ply.

$ pip install -r requirements.txt

Execute PPAPScript.

$ python ppapscript.py

Let's play interactively. (Currently only interactive)

Summary

I was able to easily create an original language by using ply.
It was a little disappointing that I couldn't think of as many specifications as TrumpScript. (If you come up with some interesting specs, Issue, Pull Request is waiting)

next time

Next time Tech-Circle Hands on Advent Calendar 2016 will be in charge of my synchronization Koga Yuta is. Probably a robot. It may be interesting to apply this article to create an original robot command language.

reference

BppLOG: The programming language "TrumpScript" inspired by Donald Trump is too messy http://tkybpp.hatenablog.com/entry/2016/07/26/150000
Github: ply https://github.com/dabeaz/ply
TYPEA.INFO: Python PLY http://typea.info/tips/wiki.cgi?page=Python+PLY
OKI software: Algol 60 processing system made with LY (Python Lex-Yacc) http://www.oki-osk.jp/esc/ply-algol/

I tried to make an original language "PPAP Script" that imaged PPAP (Pen Pineapple Appo Pen) with Python