Article content

Suppose you have multiple XML files at hand and want to weigh those data. It's hard for humans to compare multiple XML files, so I'll convert them to a CSV file with one line per XML file. If you open the CSV file with Excel etc., it will be easier to compare and examine the data. I implemented the conversion to CSV in Python, but forgive me that the code isn't very well-behaved.

environment

--OS: Windows10 (I haven't tried it, but it can be executed on other OS) --XML database: BaseX 9.3.2

Python 3.7.4

Reference URL (Thank you)

-[Addition little by little] From installation of XML database "BaseX" to benchmarking

Load an XML document into BaseX

First, load multiple XML documents at hand into BaseX. I wrote and worked the following Python code with reference to Code here.


from pathlib import Path
import os
import pprint
from BaseXClient import BaseXClient

#Create session
session = BaseXClient.Session('localhost', 1984, 'admin', 'admin')

try:
    #XML file to be stored in DB
    xml_directory = Path("C:\\") / "xml_data"
    list_xml_path = sorted(xml_directory.glob("*.xml"), key=os.path.getmtime)
    print("XML file to load:")
    pprint.pprint(list_xml_path)

    #DB open
    session.execute("open testdb")
    print(session.info())

    #Read the xml file and add it to the DB
    for path in list_xml_path:
        with open(path, mode='r', encoding="utf-8") as fi:
            str_xml = fi.read()

        session.add(path.name, str_xml)
        print(session.info())

    #Display the contents of DB
    print("\n" + session.execute("xquery /"))
    print("Completed normally\n")

finally:
    #Close session
    if session:
        session.close()

To supplement the explanation of the code,

--Load the XML document into the existing database "testdb". Please refer to this article for how to create a DB. --The location of the XML file is "C: \ xml_data". --For the XML file, download sample-1.xml to sample-8.xml of this site (thank you), and the contents are "encoding". I rewrote it to "UTF-8" and saved it again in UTF-8 format. --XML files are loaded in chronological order with file timestamps.

When I ran the code, the XML document was loaded with the following display:

XML file to load:
[WindowsPath('C:/xml_data/sample-1.xml'),
 WindowsPath('C:/xml_data/sample-2.xml'),
 WindowsPath('C:/xml_data/sample-3.xml'),
 WindowsPath('C:/xml_data/sample-4.xml'),
 WindowsPath('C:/xml_data/sample-5.xml'),
 WindowsPath('C:/xml_data/sample-7.xml'),
 WindowsPath('C:/xml_data/sample-6.xml'),
 WindowsPath('C:/xml_data/sample-8.xml')]
Database 'testdb' was opened in 1.48 ms.

Resource(s) added in 3.72 ms.

Resource(s) added in 1.93 ms.

Resource(s) added in 8.89 ms.

Resource(s) added in 1.91 ms.

Resource(s) added in 2.05 ms.

Resource(s) added in 2.05 ms.

Resource(s) added in 1.93 ms.

(Omitted)

Completed normally

Issue a query to BaseX and output to a CSV file

Then it queries BaseX to output a CSV file that makes one XML document one line. I wrote the following Python code and made it work.


import pprint
from BaseXClient import BaseXClient


#Own exception class
class MyException(Exception):
    pass


#Read the xpath of the root element from the DB
def read_xpath_root(session):
    set_xpath_root = set()

    query = f'''\
for $root in /*
return fn:path($root)
'''

    query_obj = session.query(query)
    query_obj.execute()

    #Add xpaths for all root elements to a unique set
    for typecode, item in query_obj.iter():
        set_xpath_root.add(item)

    print("root element XPath:")
    pprint.pprint(set_xpath_root)

    query_obj.close()

    #I decided to allow only one type of root element.
    if len(set_xpath_root) == 1:
        pass
    else:
        msg = f"assert len(set_xpath_root)<{len(set_xpath_root)}> == 1"
        raise MyException(msg)

    return set_xpath_root.pop().replace("[1]", "[01]") + '/'


#Collect all types of xpaths of text elements from DB
def read_xpath_text(session):
    set_xpath_text = set()

    query = f'''\
for $text in //text()
return fn:path($text)
'''

    query_obj = session.query(query)
    query_obj.execute()

    #Add xpaths for all text elements to a unique set
    for typecode, item in query_obj.iter():
        set_xpath_text.add(item)

    query_obj.close()

    #I want to sort, so move from set to list
    list_xpath_text = []
    for xpath_text in set_xpath_text:
        list_xpath_text.append(
            #When sorting[1]But[10]For problems that will be later. For the time being
            xpath_text
                .replace("[1]", "[01]")
                .replace("[2]", "[02]")
                .replace("[3]", "[03]")
                .replace("[4]", "[04]")
                .replace("[5]", "[05]")
                .replace("[6]", "[06]")
                .replace("[7]", "[07]")
                .replace("[8]", "[08]")
                .replace("[9]", "[09]")
        )
    list_xpath_text.sort()
    print("XPath of text element (all types):")
    pprint.pprint(list_xpath_text)
    return list_xpath_text


#Create session
session = BaseXClient.Session('localhost', 1984, 'admin', 'admin')

try:
    #DB open
    session.execute("open testdb")
    print(session.info())

    #read the xpath of the root element
    xpath_root = read_xpath_root(session)

    #Collect all types of xpath of text element (use csv column name)
    list_xpath_text = read_xpath_text(session)

    #Assemble and issue a query that returns a row of csv
    csv_header = "input_path"
    query = f'''\
for $root in /*
let $base_uri := fn:base-uri($root)
order by $base_uri
return <ROW>"{{fn:substring($base_uri, fn:string-length(db:name($root)) + 3)}}"\
'''

    #csv column loop
    for xpath_text in list_xpath_text:
        # xpath_Format from text to csv header,+=
        csv_header += ',' \
                      + xpath_text\
                          .replace(xpath_root, "")\
                          .replace("/text()[01]", "")\
                          .replace("Q{}", "")

        # xpath_Format from text for XQuery,+=
        query += ',"{' + xpath_text.replace(xpath_root, "$root/") + '}"'

    query += "</ROW>"

    #DB inquiry
    query_obj = session.query(query)
    query_obj.execute()

    #Output file name
    basename = xpath_root\
                       .replace("Q{}", "")\
                       .replace("[01]", "")\
                       .replace('/', "")

    #Query statement output
    with open(basename + "_xquery.txt", 'w') as fo:
        fo.write(query)
        fo.write('\n')

    #csv output
    with open(basename + ".csv", 'w') as fo:
        #Header output
        fo.write(csv_header)
        fo.write('\n')

        #Line output
        for typecode, item in query_obj.iter():
            fo.write(item.replace("<ROW>", "").replace("</ROW>", '\n'))

    query_obj.close()
    print("Completed normally\n")

finally:
    #Close session
    if session:
        session.close()

To supplement the explanation of the code,

--This code requires that all XML documents in the DB have the same root element name. --The CSV file name is automatically generated based on the root element name. --Extract the text element from the XML document and use it as CSV data. --Set the XPath of the text element as the CSV column name. --As a rough procedure, first collect all types of XPath of text elements from all XML documents (implemented by read_xpath_text function). --Next, use that XPath to assemble an XQuery statement and output the query result to CSV. --Since XML is semi-structured data, there may be XML documents that do not have the XPath included in the query. If XPath does not exist in the XML document, an empty string will be output to CSV as its column data. --Only the first column of CSV (column name = input_path) is output by extracting the XML file name from the result of the db: name function of XQuery, not from the text element of XML.

When I ran the code, the following was displayed and the CSV file "manyosyu.csv" was output.

Database 'testdb' was opened in 1.12 ms.

root element XPath:
{'/Q{}manyosyu[1]'}
XPath of text element (all types):
['/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[01]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[02]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[03]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[02]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[04]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[05]/Q{}yomi[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}image[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}mean[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}mkana[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}pno[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}poet[01]/text()[01]',
 '/Q{}manyosyu[01]/Q{}volume[01]/Q{}poem[06]/Q{}yomi[01]/text()[01]']
Completed normally

The following query statement debug file is also output.

`manyosyu_xquery.txt`



for $root in /*
let $base_uri := fn:base-uri($root)
order by $base_uri
return <ROW>"{fn:substring($base_uri, fn:string-length(db:name($root)) + 3)}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[01]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[02]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[03]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mean[01]/text()[02]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[04]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[05]/Q{}yomi[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}image[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}mean[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}mkana[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}pno[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}poet[01]/text()[01]}","{$root/Q{}volume[01]/Q{}poem[06]/Q{}yomi[01]/text()[01]}"</ROW>

Let's open the generated "manyosyu.csv" in Excel.

A CSV line was generated for each XML file loaded into the DB.

Convert XML document stored in XML database (BaseX) to CSV format (using Python)