Read the Python-Markdown source: How to create a parser

Purpose

I decided to learn how the markdown parser library is implemented. (Abrupt) It is said that the design (design pattern) of Java's dom parser is wonderful, but first I searched for a Python library that I am accustomed to.

By the way, this is Java. https://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm

This time I will read the source of the following library.

Python-Markdown https://github.com/Python-Markdown

It seems that you can convert to Markdown-> HTML. Of course, we are analyzing Markdown inside, so let's see what kind of design it is!

If you find something wrong with your understanding, please point it out ...!

Note

It seems that the core functions are collected under Python-Markdown / markdown / markdown /. Since it will be troublesome in the future, basically the files under this directory will be abbreviated as sample.py.

In addition, the source code posted below is basically an excerpt of only the necessary parts (+ comment out and write a memo).

Spoilers for the main part first

The bottom line is that each processor that detects a particular element was activated block by block. The target to act on is each block of the original text divided by "\ n \ n". This is, for example

<b>The tag is incomplete, but in bold

Not in bold</b>

If a blank line is inserted (= "\ n \ n" appears) like this, the effective range of the element will be cut off.

In this example, first

["<b>The tag is incomplete, but in bold", "Not in bold</b>"」

First, the " <b> tag is incomplete, but the processor that detects each element is operated in order for ", which is bold, and then proceed to the next block ... I think that it will be the flow.

Process flow

At the heart of the user's interface in this library are the following Markdown classes and their convert methods.

core.py


class Markdown:

    #Convert to html here
    def convert(self, source):
        # source :Markdown text

There is a comment on the convert method, which looks like this when translated into Japanese.

  1. The preprocessors convert the text Parse the high-level structure element of the text preprocessed in 2, 1 into the Element Tree
  1. The tree processors process the Element Tree. For example, Inline Patterns finds inline elements
  2. Make some post-processors work on the serialized version of ElementTree.
  3. Write the result to a character string

The original text may be easier to read anymore w

Step 1 preprocessors

core.py



class Markdown:

    def convert(self, source):
        <Abbreviation>
        self.lines = source.split("\n")
        for prep in self.preprocessors:
            self.lines = prep.run(self.lines)

First, divide into each line and let the pretreatment bite. The preprocessors are obtained below.

preprocessors.py



def build_preprocessors(md, **kwargs):
    """ Build the default set of preprocessors used by Markdown. """
    preprocessors = util.Registry()
    preprocessors.register(NormalizeWhitespace(md), 'normalize_whitespace', 30)
    preprocessors.register(HtmlBlockPreprocessor(md), 'html_block', 20)
    preprocessors.register(ReferencePreprocessor(md), 'reference', 10)
    return preprocessors

NormalizeWhitespace > HtmlBlockPreprocessor > ReferencePreprocessor Preprocessing is registered in the order of priority.

As the name suggests NormalizeWhitespace: Normalization of blank and line feed characters (implementation of about 10 lines) HtmlBlockPreprocessor: Analysis of html elements (250 lines ...!) ReferencePreprocessor: Find the link expressed in the format of Title and register it in the dictionaryMarkdown.references (30 lines)

Let's skip the contents of HtmlBlockPreprocessor. A similar element detection process should be waiting in steps 2 and 3 ...

(Digression) Like ReferencePreprocessor, the scene where you have to judge whether it will be the contents of the element until the next line often appears in the parser, and when you make it yourself (study)

while lines:
    line_num += 1
    line = self.lines[line_num]
    <processing>
    if (Include up to next line):
        line_num += 1
        line = self.lines[line_num]

However, Reference Preprocessor uses pop.

while lines:
    line = lines.pop(0)

Yeah, I should have done that ... sweat

(End of digression)

Step 2 Perspective on the Erement Tree

This process is the following part.

core.py



class Markdown:

    def convert(self, source):
        <Abbreviation>
        # Parse the high-level elements.
        root = self.parser.parseDocument(self.lines).getroot()

Here, self.parser is the BlockParser obtained by the following function.

blockprocessors.py



def build_block_parser(md, **kwargs):
    """ Build the default block parser used by Markdown. """
    parser = BlockParser(md)
    parser.blockprocessors.register(EmptyBlockProcessor(parser), 'empty', 100)
    parser.blockprocessors.register(ListIndentProcessor(parser), 'indent', 90)
    parser.blockprocessors.register(CodeBlockProcessor(parser), 'code', 80)
    parser.blockprocessors.register(HashHeaderProcessor(parser), 'hashheader', 70)
    parser.blockprocessors.register(SetextHeaderProcessor(parser), 'setextheader', 60)
    parser.blockprocessors.register(HRProcessor(parser), 'hr', 50)
    parser.blockprocessors.register(OListProcessor(parser), 'olist', 40)
    parser.blockprocessors.register(UListProcessor(parser), 'ulist', 30)
    parser.blockprocessors.register(BlockQuoteProcessor(parser), 'quote', 20)
    parser.blockprocessors.register(ParagraphProcessor(parser), 'paragraph', 10)
    return parser

Processors are also registered here along with their priorities. Click here for BlockParser.praseDocument ().

blockparser.py



class BlockParser:

    def __init__(self, md):
        self.blockprocessors = util.Registry()
        self.state = State()
        self.md = md

    #Create an Element Tree
    def parseDocument(self, lines):
        self.root = etree.Element(self.md.doc_tag)
        self.parseChunk(self.root, '\n'.join(lines))
        return etree.ElementTree(self.root)

    def parseChunk(self, parent, text):
        self.parseBlocks(parent, text.split('\n\n'))

    def parseBlocks(self, parent, blocks):
        while blocks:
            for processor in self.blockprocessors:
                if processor.test(parent, blocks[0]):
                    if processor.run(parent, blocks) is not False:
                        break

in short,

core.py


root = self.parser.parseDocument(self.lines).getroot()

In the part of, each Block Processor is made to process.

For example, a processor that handles the hashtag header "# header" format is defined as follows:

blockprocessors.py



class HashHeaderProcessor(BlockProcessor):
    """ Process Hash Headers. """

    RE = re.compile(r'(?:^|\n)(?P<level>#{1,6})(?P<header>(?:\\.|[^\\])*?)#*(?:\n|$)')

    def test(self, parent, block):
        return bool(self.RE.search(block))

    def run(self, parent, blocks):
        block = blocks.pop(0)
        m = self.RE.search(block)
        if m:
            ```from here```
            before = block[:m.start()]
            after = block[m.end():]
            if before:
                #Recursive processing only for the before part
                self.parser.parseBlocks(parent, [before])
            h = etree.SubElement(parent, 'h%d' % len(m.group('level')))
            h.text = m.group('header').strip()
            if after:
          #Then add to the beginning of blocks to process after
                blocks.insert(0, after)
            ```This is the core```
        else:
            logger.warn("We've got a problem header: %r" % block)

Let's take another look at the processor for citation blocks in the "> text" format.

What you have to think about in the quote block -Blocks that are continuous on multiple lines are regarded as one block. ・ The contents of the block must also be parsed That's right.

blockprocessors.py



class BlockQuoteProcessor(BlockProcessor):

    RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')

    def test(self, parent, block):
        return bool(self.RE.search(block))

    def run(self, parent, blocks):
        block = blocks.pop(0)
        m = self.RE.search(block)
        if m:
            before = block[:m.start()]
            #This is the same as the Hash Header Processor
            self.parser.parseBlocks(parent, [before])
            #At the beginning of each line">"Delete
            block = '\n'.join(
                [self.clean(line) for line in block[m.start():].split('\n')]
            )
        ```Consider whether the citation block has continued or is this the beginning```
        sibling = self.lastChild(parent)
        if sibling is not None and sibling.tag == "blockquote":
            quote = sibling
        else:
            quote = etree.SubElement(parent, 'blockquote')
        self.parser.state.set('blockquote')
        ```Parse the contents of the quoted block. Parents are in the current block (quote)```
        self.parser.parseChunk(quote, block)
        self.parser.state.reset()

By the way, sibling is a word for brothers and sisters.

I've looked at two processor classes, but where do you store your parsing results? etree.SubElement(parent, <tagname>) The part is suspicious.

In the first place, etree is an instance of xml.etree.ElementTree in the standard library of python. By ʻetree.SubElement (parent, ), you are adding a child element to BlockParser (). Root (also an instance of ʻElementTree).

As the process progresses, the results are saved as BlockParser (). Root.

Step 3 tree processor

As before, this time bite the tree processor.

treeprocessors.py



def build_treeprocessors(md, **kwargs):
    """ Build the default treeprocessors for Markdown. """
    treeprocessors = util.Registry()
    treeprocessors.register(InlineProcessor(md), 'inline', 20)
    treeprocessors.register(PrettifyTreeprocessor(md), 'prettify', 10)
    return treeprocessors

ʻInlineProcessor: Processing for inline elements PrettifyTreeprocessor`: Processing of line feed characters, etc.

At the end

Yeah, it's over! ?? But have you seen the rough design patterns of this library? After that, if there is a part you care about, it is better to see it by yourself ...

I'm a little tired.

Thank you for watching until the end ...!

Recommended Posts

Read the Python-Markdown source: How to create a parser
How to create a submenu with the [Blender] plugin
How to create a Conda package
How to read the SNLI dataset
How to create a virtual bridge
How to create a Dockerfile (basic)
How to create a config file
How to create a clone from Github
How to create a git clone folder
How to make a command to read the configuration file with pyramid
How to create a record by pasting a relation to the inheriting source Model in the Model inherited by Django
How to create a repository from media
How to create a wrapper that preserves the signature of the function to wrap
[Development environment] How to create a data set close to the production DB
How to calculate the volatility of a brand
How to read a CSV file with Python 2/3
How to create a Python virtual environment (venv)
How to create a function object from a string
How to create a JSON file in Python
How to create a shortcut command for LINUX
I read "How to make a hacking lab"
[Note] How to create a Ruby development environment
How to create a Kivy 1-line input box
How to create a multi-platform app with kivy
How to create a Rest Api in Django
How to read a file in a different directory
Create a command to get the work log
[Note] How to create a Mac development environment
How to read PyPI
How to read JSON
How to create an article from the command line
Create a function to visualize / evaluate the clustering result
How to write a GUI using the maya command
[Go] How to create a custom error for Sentry
How to create a local repository for Linux OS
How to create a simple TCP server / client script
How to read the CBC (Pulp, python-mip) solver log
How to post a ticket from the Shogun API
[Python] How to create a 2D histogram with Matplotlib
How to create a kubernetes pod from python code
How to use the generator
How to call a function
How to hack a terminal
How to use the decorator
How to increase the axis
How to start the program
[Ubuntu] How to delete the entire contents of a directory
How to switch the configuration file to be read by Python
How to use the __call__ method in a Python class
Probably the easiest way to create a pdf with Python3
How to create a flow mesh around a cylinder with snappyHexMesh
[Python Kivy] How to create a simple pop up window
How to generate a query using the IN operator in Django
How to get the last (last) value in a list in Python
A story about how to deal with the CORS problem
How to find the scaling factor of a biorthogonal wavelet
How to create a SAS token for Azure IoT Hub
Road to the Wizard: Step 8 (Interpret How to become a Hacker)
I want to create a Dockerfile for the time being.
How to connect the contents of a list into a string
How to create a new file when the specified file does not exist — write if the file exists