[JAVA] Memo of parsing by SAX parser of RDF / XML data model

Note that it was difficult to implement the SAX parser for RDF / XML (implementation of the event handler). It's hard, so use the library as much as possible. I think I can do most things with rdf4j.

Since it is a memo for myself in the future (when I forget various things), it is an intro that includes impression poems. The actual source code is not listed. As a hint when reimplementing or deciphering the implementation.

Intro

What is RDF

https://ja.wikipedia.org/wiki/Resource_Description_Framework

--Data structure represented by a directed graph. --The directed edge itself is called Predicate, the start point of the edge is called Subject, and the end point is called Object. --Represent the graph in the form of an edge set with a pair of <Subject, Predicate, Object> --Vertices are "node elements" (although the name is confusing, in this article, nodes in graph theory are referred to as "vertices". What is called "node" refers to "node element"). There is a "property element". --Subject is always a node element --Object can be a node element or a property element --Therefore, the data is represented as a set of (node element) × (arc) × (node element or property element). --Node element has an ID --The ID of the node element may or may not be explicitly given, the latter being blank node (general translation: blank node. "Unnamed node" is easier for Java programmers to understand. Called)

RDF/XML

--The above set of (node element) × (arc) × (node element or property element) is expressed in XML. --Basically represented by the following two

Object is a node element.xml


<rdf:Description rdf:about="Subject node ID">
  <Predicate>
    <rdf:Description rdf:about="Object node ID">
    </rdf:Description>
  </Predicate>
</rdf:Description>

Object is a property element.xml


<rdf:Description rdf:about="Subject node ID">
  <Predicate>
Object property value
  </Predicate>
</rdf:Description>

—— However, there are many abbreviation rules and it is difficult to parse. -(Maybe it seems that various omissions can be made to output the RDF model so that it is easy for humans to see, and the level is low for writing a parser for the computer to parse. It looks like a specification that is not friendly to programmers). * Impressions of low-level programmers)

Problems when dealing with programs

--Since it has a graph structure, it cannot be traced like a DOM. --In the case of a DOM-like XML tree, when you think "I want to see Piyo of Fuga in Hoge (example: phone number of Mr. B in department A)", if it is a tree-like XML, Hoge / Fuga / Piyo (/ department [@ department name = A] / employee [@ name = B] / phone number / text ()) You can get something like that, but it will not be treeed as XML. , It is necessary to go through a process such as fetching Hoge's ID ⇒ Fuga's ID and fetching Fuga's ID ⇒ Piyo's value. If the amount of data is large, if you do not create an index (sort tree or hash table) with the ID of the node element so that it can be searched, it will always scan all and it will be slow. --Even if it is not equivalent as XML, it can be equivalent as RDF (structure expressed) ――It's hard because there are many abbreviations

Therefore, it is inconvenient to use XML as it is, and it is necessary to rewrite it in the form of a set of (node element) × (arc) × (node element or property element).

Points when writing with a SAX parser

That's why I read it from the beginning with the SAX parser and write the confirmed value of (node element) × (arc) × (node element or property element) to the outside (file or DB). The points for implementing the SAX parser (event handler) are as follows.

--It is necessary to give the parser a state and make a state transition according to the mode. - rdf:parseType="Collection", rdf:parseType="Literal", rdf:parseType="Resource" --2 patterns unless otherwise specified -↑ In the latter case, there are two patterns when not specified: when the subject has been read (parses the next element as Predicate) and when the predicate has been read (the next element is Object). To separate states by node element or property element).

There are {Root, S1, S2, Collection, Resource, Literal} states (S1 has not finished predicate parsing, S2 has finished predicate parsing), and the state changes every time an element is opened. The current state is pushed onto the stack for each state transition. If the element is closed, it pops off the stack and returns. The state transition when the element is opened is as follows.

Root  (+Any element)|-> S1
S1 (+ rdf:parseType="Collection|Resource|Literal"Element)|-> Collection|Resource|Literal 
S1 (+ rdf:Including Resource=Element of node element)|-> S1
S1 (+ rdf:Elements that do not contain Resources)|-> S2
S2 (+Any element)|-> S1
Resource (+ rdf:parseType="Collection|Resource|Literal"Element)|-> Collection|Resource|Literal 
Resource (+ rdf:Including Resource=Element of node element)|-> S1
Resource (+ rdf:Elements that do not contain Resources)|-> S2
Collection (+Any element)|-> S2

Will be.

After that, if you put the sentence to be read (Subject or Subject × Predicate) and the above state on the stack and read it while pushing or popping, you can parse it properly with SAX.

Recommended Posts

Memo of parsing by SAX parser of RDF / XML data model
[Rails] Temporary retention of data by session
The contents of the data saved by CarrierWave.