Python's ElementTree XML API provides several parsing methods. Focus on speed and compare.
The environment used for the measurement is as follows.
Python has an ElementTree XML API for working with XML.
There are four main methods offered.
fromstring
XMLParser
XMLPullParser
iterparse
These are used properly according to conditions such as memory efficiency and blocking avoidance. This time, I don't care about those conditions, I just focus on the processing time of huge XML.
Wiktionary For English dump files. The dump data is provided compressed with bzip2. It is used as it is compressed without decompressing it. (It will be about 6GB when expanded)
The dump file has the following tag structure.
<mediawiki>
<siteinfo> ⋯ </siteinfo>
<page> ⋯ </page>
<page> ⋯ </page>
⋮
<page> ⋯ </page>
</mediawiki>
The whole is so huge that it is compressed by a multi-stream method so that only a part can be extracted and processed.
siteinfo | page × 100 | page × 100 | ⋯ |
This time, we will process the whole stream while fetching it one by one.
The second and subsequent streams contain 100 <page>
tags.
<page> ⋯ </page>
<page> ⋯ </page>
⋮
<page> ⋯ </page>
Each <page>
tag has the following structure.
<page>
<title>title</title>
<ns>Namespace</ns>
<id>ID</id>
<revision>
<id>Revision ID</id>
<parentid>Revision ID before update</parentid>
<timestamp>Update date and time</timestamp>
<contributor>
<username>Editor</username>
<id>Editor ID</id>
</contributor>
<comment>Comments at the time of editing</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text bytes="length" xml:space="preserve">Article content</text>
<sha1>Hash value</sha1>
</revision>
</page>
Of these, the contents of the <text>
tag are extracted and the number of lines is counted. These are some of the surveys conducted in the following articles.
The scripts used in this article are contained in the following repositories:
Implement getpage
to extract<id>
and<text>
from<page>
in each way.
The part that counts the number of lines using getpage
is shared.
common part
lines = 0
with open(target, "rb") as f:
f.seek(slen[0])
for length in slen[1:-1]:
for id, text in getpages(f.read(length)):
for line in text():
lines += 1
print(f"lines: {lines:,}")
Slen
is the length of each stream investigated in advance.fromstring
Create a tree by passing XML as a string. Perform operations on the tree (search for tags, retrieve them, etc.).
Code | Processing time |
---|---|
python/research/countlines-text-xml.py | 5m50.826s |
def getpages(bz2data):
xml = bz2.decompress(bz2data).decode("utf-8")
pages = ET.fromstring(f"<pages>{xml}</pages>")
for page in pages:
if int(page.find("ns").text) == 0:
id = int(page.find("id").text)
with io.StringIO(page.find("revision/text").text) as t:
def text():
while (line := t.readline()):
yield line
yield id, text
Since XML requires a root tag, it is temporarily enclosed in <pages>
. If you remove this, an error will occur.
<ns>
represents the page type. Since 0
is a normal page, we have narrowed it down to that.
There are several types of <id>
. page.find ("id ")
applies only to <id>
directly under <page>
. Specify page.find ("revision / text ")
because <text>
is inside <revision>
. This method of specification is called XPath.
XMLParser
fromstring
is the parser used internally to build the tree. Using this directly is speedy because you can skip building the tree and just parse it.
It is a parser of the type called ** push type **, which is almost the same as the method called SAX. Prepare a class with methods to handle events such as tag start and tag end.
Code | Processing time |
---|---|
python/research/countlines-text-xmlparser.py | 6m46.163s |
class XMLTarget:
def __init__(self):
self._ns = 0
self._id = 0
self._data = []
self.pages = []
def start(self, tag, attrib):
self._data = []
def data(self, data):
self._data.append(data)
def end(self, tag):
if tag == "ns":
self._ns = int(self._data[0])
self._id = 0
elif self._id == 0 and tag == "id":
self._id = int(self._data[0])
elif self._ns == 0 and tag == "text":
text = []
cur = ""
for d in self._data:
if d == "\n":
text.append(cur)
cur = ""
else:
cur += d
if cur: text.append(cur)
self.pages.append((self._id, text))
self._data = []
def getpages(bz2data):
target = XMLTarget()
parser = ET.XMLParser(target=target)
parser.feed("<pages>")
parser.feed(bz2.decompress(bz2data).decode("utf-8"))
return target.pages
data
is the content of the tag. Since it is sent separated by long lines or line breaks, it is grouped by line.
The documentation shows sample code similar to the following.
>>> class MaxDepth: # The target object of the parser
... maxDepth = 0
... depth = 0
... def start(self, tag, attrib): # Called for each opening tag.
... self.depth += 1
... if self.depth > self.maxDepth:
... self.maxDepth = self.depth
... def end(self, tag): # Called for each closing tag.
... self.depth -= 1
... def data(self, data):
... pass # We do not need to do anything with data.
... def close(self): # Called when all data has been parsed.
... return self.maxDepth
The variables maxDepth
and depth
defined directly under the class are class variables and are common to all instances. In this code, instance variables are created and class variables are hidden when updating such as + =
and =
. Therefore, the value of the class variable is never updated with 0
.
But the story is different if the class variable uses a method such as ʻappend` in the list. I was addicted to this and couldn't write the code that worked as I expected.
XMLPullParser
It is a type of parser called ** pull type **. In Java, it is called StAX for SAX.
Unlike the push type, there is no need to prepare a class, and it can be processed by loops and conditional branches, which simplifies the code.
[Reference] Comparison between XmlReader and SAX Reader | Microsoft Docs
Code | Processing time |
---|---|
python/research/countlines-text-xmlpull.py | 6m4.553s |
def getpages(bz2data):
xml = bz2.decompress(bz2data).decode("utf-8")
parser = ET.XMLPullParser()
parser.feed("<pages>")
parser.feed(xml)
ns, id = 0, 0
for ev, el in parser.read_events():
if el.tag == "ns":
ns = int(el.text)
id = 0
elif id == 0 and el.tag == "id":
id = int(el.text)
elif ns == 0 and el.tag == "text":
with io.StringIO(el.text) as t:
def text():
while (line := t.readline()):
yield line
yield id, text
Since there are several types of <id>
, initialize and check with ʻid = 0so that only the one immediately after
iterparse
It is a pull type parser. The difference from XMLPullParser
is explained as follows.
XMLPullParser
is so flexible that it may be inconvenient for users who simply want to use it. If your application does not have to block when reading XML data, but wants the ability to parse incrementally, see ʻiterparse ()`. This is useful if you are reading a large XML document and do not want it to be all in memory.
ʻIterparse ()returns an iterator and advances the parse each time you proceed with
next (). The structure of the code used is almost the same as
XMLPullParser`.
Code | Processing time |
---|---|
python/research/countlines-text-xmliter.py | 6m29.298s |
def getpages(bz2data):
xml = bz2.decompress(bz2data).decode("utf-8")
ns, id = 0, 0
for ev, el in ET.iterparse(io.StringIO(f"<pages>{xml}</pages>")):
if el.tag == "ns":
ns = int(el.text)
id = 0
elif id == 0 and el.tag == "id":
id = int(el.text)
elif ns == 0 and el.tag == "text":
with io.StringIO(el.text) as t:
def text():
while (line := t.readline()):
yield line
yield id, text
Of all the tests I've tried, the tree-building fromstring
was the fastest. However, the difference is not large, so unless you are dealing with data of this scale (6GB), it will not be a big problem.
method | code | processing time |
---|---|---|
fromstring | python/research/countlines-text-xml.py | 5m50.826s |
XMLParser | python/research/countlines-text-xmlparser.py | 6m46.163s |
XMLPullParser | python/research/countlines-text-xmlpull.py | 6m04.553s |
iterparse | python/research/countlines-text-xmliter.py | 6m29.298s |
For reference, the .NET Framework also uses the pull-type XmlReader
to build trees. It is faster to use XmlReader
directly without building a tree.
method | code | processing time | Remarks |
---|---|---|---|
XmlDocument | fsharp/research/countlines-text-split-doc.fsx | 6m21.588s | There is a tree |
XmlReader | fsharp/research/countlines-text-split-reader.fsx | 3m17.916s | No tree |
Recommended Posts