In the previous article, I briefly introduced the Basic Concepts of XPath. This time, I will introduce how to specify and acquire data from a Web page (HTML) using XPath, that is, how to write XPath.
In the HTML sample below, you can see that the text is surrounded by symbols such as <> </ strong>, such as </ html>. Such symbols such as <> </ strong> are called tags.
** <Tag name> The content will be here ... </ Tag name>
**
The first tag is called the "start tag" and the end tag is called the "end tag". And the whole from this start tag to the end tag is called an element.
The part displayed in red in the HTML below is the tag. (It is displayed in blue in Firefox and purple in Chrome.)
Below is a summary of the tags you often see in HTML. See this article for more details!
** The most common way to write XPath is to write the tags separated by a slash “/”. ** **
For example, if you want to get "Harry Potter" from this HTML, you can specify "html tag-> body tag-> h1 tag" in order from the top of the tree structure. Write as follows.
/html/body/h1
You can also use "//" to omit the halfway path.
//h1
You can specify the Nth tag if you want to match more than one tag. In this example, when getting "7,631 yen", it is "span" on the second line from the "div" line, so write as follows.
//div/span[2]
In abstraction, the XPath syntax written in tags (elements) looks like this.
** // Tag name //タグ名/タグ名
**
An attribute is described in a tag and represents the information of the tag in detail. By adding attributes to tags, you can specify the effect of the element and add specific instructions. Attributes are usually displayed as ** "id =" booksTitle "" **. It is also possible to specify multiple attributes.
** <Tag name Attribute name =" Attribute value ">
**
The most common attributes are href, title, style, src, id, class and so on. Please see this article for details!
** In XPath, attributes are represented by "@" functions. ** **
For example, if you want to get "Harry Potter", write XPath as follows.
//h1[@id="booksTitle"]
In abstraction, the XPath syntax written in the attribute looks like this:
** // Tag name [@ attribute name =" attribute value "]
**
If you want to get all the elements with the same attributes, write:
** // * [@ attribute name =" attribute value "]
**
The text is enclosed in tags as shown below.
** <Tag name> Text goes here ... </ Tag name>
**
Retrieving data from a web page is usually retrieving the content or text within the page. So you can directly specify the text you want to get.
** In XPath, text is represented by a "text ()" function. ** **
For example, if you want to get "Harry Potter", specify it in text and write as follows.
//h1[text()="Harry Potter"]
In abstraction, the XPath syntax written in the attribute looks like this:
** // Tag name [text () =" Text to get "]
**
If you want to get all the elements with the same text, write:
** // * [text () =" text to get "]
**
In the HTML tree structure, all elements have a parent-child / sibling relationship.
Elements that contain one or more elements are called parent elements, and those that contain are child elements. The child element has only one parent and is between the parent's start and end tags. Elements with the same parent are called sibling elements.
Let's also look at a concrete example.
The sample below is based on the [body] element, where the [body] element is the parent of the [h1] and [div] elements, and the [h1] and [div] elements are children of the [body] element. This is an example of getting elements that have a parent-child / sibling relationship and changing the style for each.
The [h1] element and the [div] element are sibling elements because they have the same parent [body] element.
Also, since the [div] element is the parent of the two [span] elements, the two [span] elements are descendants of the [body] element.
You can get elements that have a parent-child or sibling relationship with the current element as the base point. For example, if you want to get "7,631 yen", you can write as follows if you specify it in relation to the tag.
** When making it a child element of the [div] element **
//div/span[2]
** When making it a descendant element of the [body] element **
//body//span[2]
** When making it a sibling element of the [span class = "author not Faded"] element **
//span[@class="author notFaded"]/following-sibling::span[1]
** When making it a sibling element of the [span class = "tax_postage"] element **
//span[@class="tax_postage"]/preceding-sibling::span[1]
Two functions, "following-sibling ::" and "preceding-sibling ::", are often used to specify sibling tags.
-** "following-sibling ::" specifies sibling elements after the specified element ** -** "preceding-sibling ::" specifies sibling elements before the specified element **
"Following-sibling ::" is very useful when specifying table elements. For example, there is the following HTML sample.
When this HTML is converted to a page, it will look like a table like the one below.
In this example, the store name "12345" is acquired. However, there are multiple [td] elements, and ** // td [1] ** cannot be used. Also, if you want to get tables with the same structure from multiple pages at once, it is recommended to use "following-sibling ::" with the fixed value "store name" as the base point. Write as follows.
** // th [text () =" store name "] / following-sibling :: td [1]
**
In abstraction, the XPath syntax written in tag relations looks like this. If the above syntax matches more than one, you can specify the Nth tag by adding ** [N] **.
What do you think? The above is the most used XPath writing method. Please give it a try. Next time, I will introduce the functions that are often used for XPath. looking forward to!
Original article: https://helpcenter.octoparse.jp/hc/ja/articles/360013122059
Recommended Posts