Introduction to regular expression processing system

Introduction

This is my second post, @reonyanarticle.

I was late, but I managed to finish writing it, so I will post it.

This time I will write about regular expressions. Regular expressions and natural language processing are inseparably related. There is no time to list when talking about when to use awk or grep to delete unnecessary words that exist in the sentence or to replace them. I wanted to write about such familiar regular expressions, but there are many articles about how to write regular expressions, so it's not interesting to write them. So this time, I'd like to talk a little about how regular expressions check for matching.

As a reminder, I don't talk about regular expression operators, and I use math words a lot. If you are interested, please check it out. In addition, undefined words will appear frequently, but I will gradually improve them.

Regular expression definition

In the book I referred to, "Regular expressions are a kind of" expression ". ] Is written. A regular expression is an "expression" that represents a particular set of strings.

For example, if you use only the letters a and b and only one a appears, there are a, abbbbbb, bbbbab, ab, etc., but "only one a appears" There are an infinite number of "characters". I can't write everything. However, if you use a regular expression, you can write it with one expression, b * ab *. (I recognize that it is a comprehension in a set.)

Difference in processing system

Now that we've briefly touched on the definition of regular expressions, let's look at how regular expression matching is verified. There are two types of processing systems, DFA type and VM type.

--DFA type: Converts a regular expression to a deterministic finite automaton (DFA) for matching. There is a processing system of GNU grep. --VM type: Converts a regular expression to bytecode for matching. There are processing systems for most scripting languages such as Python and JavaScript.

The actual mechanism is as follows, referring to the figure in the bibliography. DFAエンジン.png

(This time, I will not explain the mechanism of VM type matching. Please read the reference and check it!)

Matching is confirmed by constructing a DFA corresponding to a regular expression and checking whether the character string given to the DFA passes or does not pass. This time, I'm going to use Python to create a DFA-type (simple) processing system! To that end, we will start with the definition of DFA that came out earlier.

DFA definitions and examples

Definition of DFA

Deterministic finite automaton quintuplet $ {\ mathcal A} = \ langle Q, \ Sigma, \ delta, s, F \ rangle $. here,

-$ Q $ is a finite set called a state -$ \ Sigma $ is a finite set of characters -$ \ delta: Q \ times \ Sigma \ rightarrow Q $ is a function called a transition function that receives a state and a character and returns the state. -$ s \ in Q $ is called the initial state -$ F \ subset Q $ is called the accepted state.

DFA example

I don't think you can tell what you're saying from the above definition alone, so let's touch on an example. Since it is difficult to handle many characters from the beginning, let's use DFA to write a regular expression called b * ab * that only one a appears, using only a and b as usable characters. ..

Let $ \ Sigma = \ {a, b \} $ and $ Q = \ {0, 1, 2 \} $. Also, set $ s = 0 $ and $ F = \ {1 \} $, and define the transition function as follows.

\delta(0, a) = 1
\delta(0, b) = 0
\delta(1, a) = 2
\delta(1, b) = 1
\delta(2,a) = 2
\delta(2,b) = 2

Now that we have defined a quintuplet, the DFA, $ {\ mathcal A} $, where only one a appears, is as follows.

automata1.dot.png

Did you somehow understand DFA by touching on the definition and example of DFA? If you don't understand, I will write a reference, so you may find out more by reading it.

Matching using DFA

Now that we have defined the DFA, let's define that the DFA accepts strings and see if it actually matches.

Definition of DFA accepting strings

Let DFA be $ {\ mathcal A} = \ langle Q, \ Sigma, \ delta, I, F \ rangle $ and the string be $ w = w_0 \ cdots w_n $$ (w_i \ in \ Sigma) $. At this time, when the state sequence $ \ {r_0, \ ldots, r_n \} \ subset Q $ that satisfies the following three conditions exists, $ {\ mathcal A} $ is said to accept $ w $.

r_0 = s
When $ i <n $, $ \ delta (r_i, w_i) = r_ {i + 1} $
r_n \in F

Example of DFA accepting a string

I don't think it's easy to understand the definition of DFA accepting strings, so let's look at a concrete example. As before, I will touch on the regular expression b * ab *, in which only one a appears. For this regular expression, consider the string ab and the string bbbaba.

For the string ab, consider the state string $ \ {0,1 \} \ subset Q $. Then, $ \ delta (0, a) = 1 $, $ \ delta (1, b) = 1 $, and $ 0 = s $, which satisfies the condition of accepting $ 1 \ in F $. Therefore $ {\ mathcal A} $ accepts ab. This is a DFA match.

Next, regarding bbbaba, no matter what state column is fetched, it will not be accepted by $ {\ mathcal A} $. Consider $ \ {0, 0, 0, 1, 1, 2 \} \ subset Q $, which simply transitions bbbaba. $ \ Delta (0, b) = 0 $, $ \ delta (0, b) = 0 $, $ \ delta (0, b) = 0 $, $ \ delta (0, a) = 1 $, $ \ delta (1, b) = 1 $, $ \ delta (1, a) = 2 $, and $ 2 \ notin F $, so it is not satisfied. So bbbaba will not be accepted. However, the substrings bbbab and bbba are accepted. The subcolumns of the above state column, $ \ {0, 0, 0, 1, 1 \} $, $ \ {0, 0, 0, 1 \} $, each satisfy the conditions. Because. You can use this to define the shortest match as "the smallest string that DFA accepts as a substring of a string".

Now, though concrete, I was able to generate a DFA from the regular expression and use the DFA to see if the string matches the regular expression!

Implementation using Python

Now let's actually implement it using Python. DFA cannot be created without giving a concrete regular expression, so let's use b * ab * again. Actually, it would be nice if DFA could be implemented from scratch, but due to various reasons of the author, I use a module called automata. If you do your best, you can make your own, so please do your best! (I think it's a pain to make a DFA module.)

from automata.fa.dfa import DFA

dfa = DFA(
    states={'0', '1', '2'},
    input_symbols={'a', 'b'},
    transitions={
        '0': {'a': '1', 'b': '0'},
        '1': {'a': '2', 'b': '1'},
        '2': {'a': '2', 'b': '2'}
    },
    initial_state='0',
    final_states={'1'}
)


def is_accepted(words: str):
    if dfa.accepts_input(words):
        print('It will be accepted.')
    else:
        print('Not accepted.')


def check_min_matching(words: str):
    i = 0
    while i < len(words):
        if dfa.accepts_input(words[:(i+1)]):
            print(f'{words[:(i+1)]}Has been accepted.')
            break
        else:
            i = i + 1
    else:
        print(f'{words}Was not accepted.')

if __name__ == "__main__":
    is_accepted('ab')
    is_accepted('bbaba')
    check_min_matching('ab')
    check_min_matching('bbbaba')
    check_min_matching('bbb')

When I run it, the following output result is returned.

It will be accepted.
Not accepted.
a has been accepted.
bbba has been accepted.
bbb was not accepted.

You have successfully expressed b * ab * in DFA! I'll do my best to show you what I'm actually doing, so please be patient.

Finally

This time, only for b * ab *, I was able to create a processing system. I think that you can know more about regular expressions if you know through DFA how you actually check the matching of regular expressions.

By the way, can all the regular expressions be written in DFA for some readers? I think some people think that. Actually, such a theorem exists.

Kleene's theorem

For any language $ L $ on the finite character set $ \ Sigma $, the following three propositions are equivalent.

Language $ L $ can be represented by a regular expression
Language $ L $ can be recognized by automaton

In other words, anything that can be written in a regular expression can be accepted by an automaton, and a language that can be accepted by an automaton can be written in a regular expression. Because of this theorem, we can see that DFA can be created for operations other than *. It's amazing! (Insufficient ability to convey great things ...) Please try to make your own DFA for other operations and regular expressions with more characters! It is fun!

There are lots of interesting stories about theorems, DFA and regular expressions that I couldn't cover in this article! Please take a look at the references.

~~ Tomorrow ~~ Before I published the article on the 24th, @jyami wrote Story of making shell-like with go! Thank you for your hard work!

References

The books I read when learning formal languages in the olden days are:

-Basics of Theory of Computation 1. Automata and Language

Finite Automata

The books I used to study regular expressions this time are as follows. I referred to the figures and explanations on the way. It was a lot of fun to read, so please read it.

-Introduction to Regular Expression Technology-Latest Engine Implementation and Theoretical Background

As an aside, one of the authors of this book, Ryoma Shinya, has posted a lot of slides related to regular expressions and automata on the net, so it's interesting to see them too!