Implemented XPath 1.0 parser in Ruby

This article is the 16th day article of Dwango Advent Calendar 2020. I like the Advent calendar too much, so I will write the last day in addition to this. The content is thin because it is a hurry, but if you are interested, please contact us.

Introduction

The XPath 1.0 parser introduced in this article was developed for the HTML parser "gammo" to be introduced on the final day of Dwango Advent Calendar 2020.

Since the content of the final day will be specialized for HTML parser, this article will introduce the story of implementing XPath 1.0 as a mechanism for traversing the DOM tree built by gammo.

XPath 1.0

XPath 1.0 was published at the same time as XSLT 1.0, which was recommended in 1999, and has a simple specification with very few features compared to the current latest version, 3.1. I chose XPath 1.0 simply because it seemed very simple and easy to implement, and I thought it was enough to traverse the DOM tree built by gammo. Be done.

About implementation

As mentioned earlier, the XPath specification itself isn't that big. Furthermore, since it aims to follow XPath 1.0 rather than the latest specifications, the volume related to its implementation is also small.

The detailed XPath 1.0 grammar is defined by w3c, and in principle, AST can be built by implementing according to this content.

Since gammo is implemented by pure Ruby without using native extensions, we took a similar approach for XPath 1.0. In addition, instead of using third-party gems, parsing is performed using racc, which comes standard with the runtime library, and strscan is used for lexical analysis. I decided.

In contrast to HTML, XPath can be expressed by (E) BNF, and the syntax is defined by EBNF in the W3C standard specification. Gammo :: XPath :: Parser corresponds to this specification. Based on the Abstract Syntax Tree constructed in this way, each node of XPath AST (eg Path, for the element that becomes the context node when traversing using XPath (the element node that becomes the root of the target to execute XPath) Axes, Predicate, Function) are evaluated in order, and candidate elements are searched / added or filtered. After going through these processes, the final set of remaining nodes is returned as the evaluation result of the given XPath.

How Traverse works

Since gammo is implemented according to the XPath 1.0 specification in principle, we will touch on how XPath works here.

Without the omission, XPath basically builds a set of nodes with the following syntax:

/axis::Namespace:Node test[predicate]/~

In XPath, the search direction is first determined by the axis. This means the direction seen from the current context node, such as child or ancestor (ancestor). Then, the search is started in that direction, and the set is assembled by filtering the candidate element groups by the element names and expressions specified in namespace: node test. For the part that is [predicate], it is possible to filter the set assembled by the above procedure using an additional complicated formula.

In XPath, it is also possible to track HTML documents in order from the left, using / as a delimiter like URI.

Taking these into consideration, for example, for the following HTML, only those with foo specified in the class attribute value of the li element are extracted.

doc = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
<title>hello world</title>
<meta charset="utf8">
</head>
<body>
<ul>
<li class="foo">a</li>
<li class="bar">b</li>
<li class="foo">c</li>
<li class="bar">d</li>
<li class="foo">e</li>
</ul>
</body>
</html>
EOS

p doc.xpath('descendant-or-self::li[@class="foo"]').map(&:inner_text)
  #=> ["a", "c", "e"]

About functions

I didn't know it existed until I implemented it, but it seems that XPath has the concept of a function. Many functions are defined, such as functions that handle character strings, numerical values, and boolean values, and functions that handle node information.

Here, as an example, we will introduce a function that returns whether an element has a specific character string as inner text.

doc = Gammo.new(<<-EOS).parse
<!DOCTYPE html>
<html>
<head>
<title>hello world</title>
<meta charset="utf8">
</head>
<body>
</body>
</html>
EOS

#Using the contains function, the text node under the title element
# `hello`Test whether it contains the string.
p doc.xpath('contains(//title/text(), "hel")', result_type: Gammo::XPath::BOOLEAN_TYPE) #=> true

In this way, it is possible to use a function in an XPath expression.

However, since there aren't many situations that I want to use personally, the response is inadequate. If you are interested in Gammo or XPath functions, I would appreciate it if you could send me a patch while looking at the Correspondence Table.

in conclusion

This article was written as a subset of the last day's article, but I hope you'll get an idea of ​​the XPath 1.0 overview and implementation in Ruby.

Recommended Posts

Implemented XPath 1.0 parser in Ruby
Implemented "Floyd Cycle Detection Method" in Ruby
Class in Ruby
Heavy in Ruby! ??
About eval in Ruby
Output triangle in Ruby
Variable type in ruby
Fast popcount in Ruby
ABC177 --solving E in Ruby
Validate JWT token in Ruby
Math Girls Secret Note 104th implemented in Ruby and C
Read design patterns in Ruby
Write class inheritance in Ruby
Update Ruby in Unicorn environment
Integer unified into Integer in Ruby 2.4
[Ruby] Exception handling in functions
Use ruby variables in javascript.
Multiplication in a Ruby array
About regular expressions in Ruby
Birthday attack calculation in Ruby
Judgment of fractions in Ruby
Find Roman numerals in Ruby
Try using gRPC in Ruby
I wrote a C parser (like) using PEG in Ruby
[Ruby] Find numbers in arrays
NCk mod p in Ruby
Chinese Remainder Theorem in Ruby
Sorting hashes in a Ruby array
How to find May'n in XPath
Basics of sending Gmail in Ruby
How to iterate infinitely in Ruby
Try to implement Yubaba in Ruby
Implementation of ls command in Ruby
Achieve 3-digit delimited display in Ruby
Encoding when getting in Windows + Ruby
Run GraphQL Ruby resolver in parallel
Ruby on Rails Japanese-English support i18n
[Ruby] Extracting double hash in array
[Ruby] then keyword and case in
How to install Bootstrap in Ruby
String output method memo in Ruby
Implement a gRPC client in Ruby
Write keys and values in Ruby
[Super Introduction] About Symbols in Ruby
Hanachan in Ruby (non-destructive array manipulation)
Manipulating data in GCS during Ruby
Is there no type in Ruby?
Try file locking in Ruby directory
[Ruby] undefined method `dark?'occurs in rqr_code
openssl version information in ruby OPENSSL_VERSION
Ruby methods often used in Rails
Make Ruby segfault in two lines