Python's ElementTree XML API provides several parsing methods. Focus on speed and compare.

The environment used for the measurement is as follows.

OS: Windows 10 1909
CPU: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx (4 cores)
Python 3.8.2 (WSL1)

Overview

Python has an ElementTree XML API for working with XML.

xml.etree.ElementTree --- ElementTree XML API — Python 3 Documentation

There are four main methods offered.

fromstring
XMLParser
XMLPullParser
iterparse

These are used properly according to conditions such as memory efficiency and blocking avoidance. This time, I don't care about those conditions, I just focus on the processing time of huge XML.

Target

Wiktionary For English dump files. The dump data is provided compressed with bzip2. It is used as it is compressed without decompressing it. (It will be about 6GB when expanded)

https://dumps.wikimedia.org/enwiktionary/
enwiktionary-20200501-pages-articles-multistream.xml.bz2 890.8 MB

The dump file has the following tag structure.

Retrieve pages from Wikipedia dump

<mediawiki>
    <siteinfo> ⋯ </siteinfo>
    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>
</mediawiki>

The whole is so huge that it is compressed by a multi-stream method so that only a part can be extracted and processed.

siteinfo

page × 100

⋯

This time, we will process the whole stream while fetching it one by one.

The second and subsequent streams contain 100 <page> tags.

    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>

Each <page> tag has the following structure.

  <page>
    <title>title</title>
    <ns>Namespace</ns>
    <id>ID</id>
    <revision>
      <id>Revision ID</id>
      <parentid>Revision ID before update</parentid>
      <timestamp>Update date and time</timestamp>
      <contributor>
        <username>Editor</username>
        <id>Editor ID</id>
      </contributor>
      <comment>Comments at the time of editing</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="length" xml:space="preserve">Article content</text>
      <sha1>Hash value</sha1>
    </revision>
  </page>

Of these, the contents of the <text> tag are extracted and the number of lines is counted. These are some of the surveys conducted in the following articles.

Exploring an efficient way to handle Wiktionary

The scripts used in this article are contained in the following repositories:

https://github.com/7shi/wiktionary-tools/tree/master/python/research

Implement getpage to extract<id>and<text>from<page>in each way.

The part that counts the number of lines using getpage is shared.

`common part`


lines = 0
with open(target, "rb") as f:
    f.seek(slen[0])
    for length in slen[1:-1]:
        for id, text in getpages(f.read(length)):
            for line in text():
                lines += 1
print(f"lines: {lines:,}")

Slen is the length of each stream investigated in advance.

fromstring

Create a tree by passing XML as a string. Perform operations on the tree (search for tags, retrieve them, etc.).

Conceptually, it is almost the same as JavaScript DOM operation.

Code	Processing time
python/research/countlines-text-xml.py	5m50.826s

def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    pages = ET.fromstring(f"<pages>{xml}</pages>")
    for page in pages:
        if int(page.find("ns").text) == 0:
            id = int(page.find("id").text)
            with io.StringIO(page.find("revision/text").text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Since XML requires a root tag, it is temporarily enclosed in <pages>. If you remove this, an error will occur.

<ns> represents the page type. Since 0 is a normal page, we have narrowed it down to that.

There are several types of <id>. page.find ("id ") applies only to <id> directly under <page>. Specify page.find ("revision / text ") because <text> is inside <revision>. This method of specification is called XPath.

XMLParser

fromstring is the parser used internally to build the tree. Using this directly is speedy because you can skip building the tree and just parse it.

It is a parser of the type called ** push type **, which is almost the same as the method called SAX. Prepare a class with methods to handle events such as tag start and tag end.

Code	Processing time
python/research/countlines-text-xmlparser.py	6m46.163s

class XMLTarget:
    def __init__(self):
        self._ns   = 0
        self._id   = 0
        self._data = []
        self.pages = []

    def start(self, tag, attrib):
        self._data = []

    def data(self, data):
        self._data.append(data)

    def end(self, tag):
        if tag == "ns":
            self._ns = int(self._data[0])
            self._id = 0
        elif self._id == 0 and tag == "id":
            self._id = int(self._data[0])
        elif self._ns == 0 and tag == "text":
            text = []
            cur = ""
            for d in self._data:
                if d == "\n":
                    text.append(cur)
                    cur = ""
                else:
                    cur += d
            if cur: text.append(cur)
            self.pages.append((self._id, text))
        self._data = []

def getpages(bz2data):
    target = XMLTarget()
    parser = ET.XMLParser(target=target)
    parser.feed("<pages>")
    parser.feed(bz2.decompress(bz2data).decode("utf-8"))
    return target.pages

data is the content of the tag. Since it is sent separated by long lines or line breaks, it is grouped by line.

Class variables

The documentation shows sample code similar to the following.

https://docs.python.org/ja/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.XMLParser

>>> class MaxDepth:                     # The target object of the parser
...     maxDepth = 0
...     depth = 0
...     def start(self, tag, attrib):   # Called for each opening tag.
...         self.depth += 1
...         if self.depth > self.maxDepth:
...             self.maxDepth = self.depth
...     def end(self, tag):             # Called for each closing tag.
...         self.depth -= 1
...     def data(self, data):
...         pass            # We do not need to do anything with data.
...     def close(self):    # Called when all data has been parsed.
...         return self.maxDepth

The variables maxDepth and depth defined directly under the class are class variables and are common to all instances. In this code, instance variables are created and class variables are hidden when updating such as + = and =. Therefore, the value of the class variable is never updated with 0.

But the story is different if the class variable uses a method such as ʻappend` in the list. I was addicted to this and couldn't write the code that worked as I expected.

I was addicted to confusing class variables and instance variables in Python

XMLPullParser

It is a type of parser called ** pull type **. In Java, it is called StAX for SAX.

Unlike the push type, there is no need to prepare a class, and it can be processed by loops and conditional branches, which simplifies the code.

[Reference] Comparison between XmlReader and SAX Reader | Microsoft Docs

Code	Processing time
python/research/countlines-text-xmlpull.py	6m4.553s

def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    parser = ET.XMLPullParser()
    parser.feed("<pages>")
    parser.feed(xml)
    ns, id = 0, 0
    for ev, el in parser.read_events():
        if el.tag == "ns":
            ns = int(el.text)
            id = 0
        elif id == 0 and el.tag == "id":
            id = int(el.text)
        elif ns == 0 and el.tag == "text":
            with io.StringIO(el.text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Since there are several types of <id>, initialize and check with ʻid = 0so that only the one immediately after` is picked up.

iterparse

It is a pull type parser. The difference from XMLPullParser is explained as follows.

https://docs.python.org/ja/3/library/xml.etree.elementtree.html#pull-api-for-non-blocking-parsing

XMLPullParser is so flexible that it may be inconvenient for users who simply want to use it. If your application does not have to block when reading XML data, but wants the ability to parse incrementally, see ʻiterparse ()`. This is useful if you are reading a large XML document and do not want it to be all in memory.

ʻIterparse ()returns an iterator and advances the parse each time you proceed withnext (). The structure of the code used is almost the same as XMLPullParser`.

Code	Processing time
python/research/countlines-text-xmliter.py	6m29.298s

def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    ns, id = 0, 0
    for ev, el in ET.iterparse(io.StringIO(f"<pages>{xml}</pages>")):
        if el.tag == "ns":
            ns = int(el.text)
            id = 0
        elif id == 0 and el.tag == "id":
            id = int(el.text)
        elif ns == 0 and el.tag == "text":
            with io.StringIO(el.text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Summary

Of all the tests I've tried, the tree-building fromstring was the fastest. However, the difference is not large, so unless you are dealing with data of this scale (6GB), it will not be a big problem.

Since these methods are used properly depending on memory efficiency and blocking, it is not judged only by speed.

method	code	processing time
fromstring	python/research/countlines-text-xml.py	5m50.826s
XMLParser	python/research/countlines-text-xmlparser.py	6m46.163s
XMLPullParser	python/research/countlines-text-xmlpull.py	6m04.553s
iterparse	python/research/countlines-text-xmliter.py	6m29.298s

For reference, the .NET Framework also uses the pull-type XmlReader to build trees. It is faster to use XmlReader directly without building a tree.

Speed comparison of Wiktionary full text processing with F # and Python

method	code	processing time	Remarks
XmlDocument	fsharp/research/countlines-text-split-doc.fsx	6m21.588s	There is a tree
XmlReader	fsharp/research/countlines-text-split-reader.fsx	3m17.916s	No tree

Speed comparison of Python XML parsing

Overview

Target

common part

Class variables

Summary

`common part`