Parse broken XML with lxml

lxml parses XML based on libxml2, but BeautifulSoup parses XML based on regular expressions, so you can parse broken XML like the one below.

For situations where you want to use lxml for speed, but you may need BeautifulSoup, and you are wondering which one to use, lxml provides the following interface.

Suppose the input is broken XML like the one below.

<piyo>bar</piyo>
<piyo>hoge</piyo>

result

python


In [1]: from lxml import etree
In [2]: with open('hoge') as f:
   ...:     xml=etree.fromstring(f.read())
   ...:       File "<string>", line unknown XMLSyntaxError: Extra content at the end of the document, line 2, column 1

python


In [3]: from lxml.html.soupparser import fromstring

In [4]: with open('hoge') as f:
   ...:     xml=fromstring(f.read())
   ...:

In [5]: for piyo in xml.findall('piyo'): print piyo.text.strip()
bar
hoge

reference

http://lxml.de/elementsoup.html

Recommended Posts

Parse broken XML with lxml
Parse XML in Python
json parse with gdb
Generate XML (RSS) with Python
Process feedly xml with Python.
Parse pcap data with tshark command
Process Pubmed .xml data with python