There are two ways to automatically retrieve data from your website. One is to create a web crawler in a programming language such as Python, and the other is a web scraping tool Like Octoparse. jp / blog / top-30-free-web-scraping-software /) to get the data. But in any case, XPath plays an important role. If you know how to write XPath, you can get the data more correctly and efficiently.
So, in the XPath series, I would like to introduce in detail from the basic concept of XPath to how to write and apply XPath.
This article briefly introduces the basic concepts of XPath.
XPath (XML Path Language) is an element from an XML / HTML document that has a tree structure. A concise syntax (language) for specifying and attribute values. Since web pages are usually written in HTML, XPath is often used to get information about web pages. When viewing the HTML of a web page in a browser (Chrome, Firefox, etc.), you can easily access the corresponding HTML document by pressing F12.
Let's take a look at how XPath works specifically. The image below is part of an HTML document.
HTML has different levels, like a tree structure. In this example, level 1 is ** bookstore ** and level 2 is ** book **. ** Title, author, year, price ** are all level 3.
Text that contains angle brackets (such as
** <○○> (start tag) Content is entered here ... </ ○○> (end tag) **
XPath describes the hierarchy separated by a slash “/”, and you can specify another node from the reference node. Similar to a URL. In this example, if you search for the element "author", the XPath would be:
/bookstore/book/author
To better understand how it works, see How to Find Specific Files on Your Computer.
To find the file named "author", the correct file path is ** \ bookstore \ book \ author **.
Just as every file on your computer has its own path, so does an element on a web page. The path is described in XPath.
An XPath that starts at the root element (the top element of the document) and goes through all the elements inside to the target element is called an absolute XPath.
** Example: / html / body / div / div / div / div / div / div / div / div / div / span / span / span…
**
Absolute XPath can be long and confusing, so to simplify absolute XPath, you can use "//" to omit halfway paths (also known as short XPath).
For example
** Absolute XPath: / bookstore / book / author
**
** Short XPath: // author
**
View this page in Chrome and view the developer tools from the right-click menu Validate. In html on the Element tab, right click on the element. Select [Copy]-> [Copy XPath] from the menu to copy the XPath to get the element to the clipboard.
From the displayed Element tab html, press “Ctrl + F” to display the search field. When you enter the XPath, the resulting element should be selected.
You can also add an extension called "XPath Helper". Enter the XPath and you will see matching results. (Install XPath Helper)
You can use the extension "Firebug" installed in the previous version of Firefox. ([How to install the Firebug & FireXPath extension](https://helpcenter.octoparse.jp/hc/ja/articles/360015765193-Firebug-FireXPath%E6%8B%A1%E5%BC%B5%E6%A9%9F] % E8% 83% BD% E3% 82% 92% E3% 82% A4% E3% 83% B3% E3% 82% B9% E3% 83% 88% E3% 83% BC% E3% 83% AB% E3 % 81% 99% E3% 82% 8B% E6% 96% B9% E6% B3% 95))
Open a web page in Firefox ➡ Click the Firebug button ➡ Click an element in the page ➡ The XPath of that element is displayed.
The above is the basic concept of XPath. Next time, I'll show you how to write XPath, so please look forward to it!
Original article: https://helpcenter.octoparse.jp/hc/ja/articles/360015765513
Recommended Posts