[RUBY] Summary of css selectors that can be used with Nookogiri

What is Nokogiri?

A library for web scraping in Ruby. Web scraping is a technology that automatically extracts desired data and sentences from homepages. As a mechanism, the target data is extracted by looking at the html structure of the homepage and specifying the html tags and attributes that can specify the desired data.

Basic usage

The basic way to write the scraping code is as follows.

scraping.rb


require 'open-uri' #Library for loading the url of the page you want to scrape
require 'nokogiri' #Nokogiri library

#Read url
url = URI.open('https://Target page url')

#Read the html of the url destination page and Nokogiri::HTML::Convert to Document class
doc = Nokogiri::HTML(url)

By performing various operations on this doc, you can extract only the data of a specific part in html.

css and at_css methods

Suppose you want to scrape a page like this:

<html>
  <head><title>Favorite movie</title></head>
  <body>
    <h3>movies</h3>
    <div class="movies">
      <div>
        <h3 class="title">In the sky of the show shank</h3>
        <p>Human drama</p>
        <p id="year">1994</p>
      </div>
    </div>
    <div class="movies">
      <div>
        <h3 class="title">Star Wars</h3>
        <p>SF</p>
        <p id="year">1977</p>
      </div>
    </div>
  </body>
</html>

The css method is a method of the Nokogiri :: HTML :: Document class, and by giving a css selector as an argument, all the elements that satisfy it are extracted. If there are more than one, you will get an array containing all the relevant elements. (Strictly speaking, the Nokogiri :: XML :: NodeSet class, not an array). If no such element is found, an empty array is returned.

doc.css("h3") #Specified by tag
# >> [ <h3>movies</h3>, <h3 class="title">In the sky of the show shank</h3>,  <h3 class="title">Star Wars</h3>]

doc.css(".title") #Specified by class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]

doc.css("#year") #Specified by id
# >> [<p id="year">1994</p>, <p id="year">1977</p>]

doc.css("h1") #Specify no element
# >> []

There is also an extraction method that uses a similar css selector called at_css. This, unlike the css method, returns only the first element that gets caught, even if there are multiple matches. Also, unlike the css method, nil is returned if the target element is not found.

doc.at_css("h3") #Specified by tag
# >>  <h3>movies</h3>

doc.at_css(".title") #Specified by class
# >> <h3 class="title">In the sky of the show shank</h3>

doc.at_css("#year") #Specified by id
# >> <p id="year">1994</p>

doc.at_css("h1") #Specify no element
# >> nil

How to specify a complex css selector

Depending on the html structure of the page, it may not be possible to specify the desired data with one specification. At that time, you can also specify the following selector.

doc.css(".movies h3") #If you separate it with a half-width space,.Extract all h3 tags under the movies class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]

doc.css(".movies > h3") # >When separated by.Extract the h3 tag directly under the movies class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]

doc.css("h3 + p") # +When separated by, the element p immediately after parallel with the h3 tag is extracted.
# >> [<p>Human drama</p>, <p>SF</p>]

doc.css("h3 ~ p") # ~When separated by, the element p after that parallel to the h3 tag is extracted.
# >> [<p>Human drama</p>,<p id="year">1994</p>, <p>SF</p>, <p id="year">1977</p>]

You can also specify it in text.

doc.css("h3:contains('Star Wars')") # :contains('String')If you give, you can search the text.
# >> <h3 class="title">Star Wars</h3>

Finally

Although not explained this time, one element obtained is an object of the Nokogiri :: XML :: Element class, from which you can extract text and get the url specified by the a tag. I will. Since the method of specifying the css selector differs depending on the target html structure and what kind of data you want, it is difficult to pattern everything into an article. Let's extract the desired data by combining the specification methods introduced this time.

Recommended Posts

Summary of css selectors that can be used with Nookogiri
Summary of ORM "uroboroSQL" that can be used in enterprise Java
Summary of JDK that can be installed with Homebrew (as of November 2019)
Organize methods that can be used with StringUtils
[Ruby] Methods that can be used with strings
Create a page control that can be used with RecyclerView
Firebase-Realtime Database on Android that can be used with copy
SwiftUI View that can be used in combination with other frameworks
Simple slot machine implementation that can be used with copy and paste
[Rails] "pry-rails" that can be used when saving with the create method
Performance analysis and failure diagnostic tools that can be used with OpenJDK
Ruby array methods that can be used with Rails (other than each)
Range where variables can be used with ruby [Scope]
About the matter that hidden_field can be used insanely
Convenient shortcut keys that can be used in Eclipse
[Book Review] Unit testing of programming sites that can be done with zero experience
List of devices that can be previewed in Swift UI
Syntax and exception occurrence conditions that can be used when comparing with null in Java
Four-in-a-row with gravity that can be played on the console
The world of Azure IoT that can be played on the DE10-Nano board: Ajuchika with FPGA !!?
A concise summary of Java 8 date / time APIs that are likely to be used frequently
Learning Ruby with AtCoder Beginners Selection [Some Sums] Increase the methods that can be used
[Book Review] Unit test of programming site that can be done with zero experience (sequel 1-JUnit ~)
Static analysis tool that can be used on GitHub [Java version]
Build an environment where pip3 can be used with CentOS7 + Python3
File form status check sheet that can be deleted with thumbnails
I made a question that can be used for a technical interview
Power skills that can be used quickly at any time --Reflection
Introduction to Java that can be understood even with Krillin (Part 1)
Summary of frequently used Docker commands
[Spring Boot] List of validation rules that can be used in the property file for error messages
Set the access load that can be changed graphically with JMeter (Part 2)
Java file input / output processing that can be used through historical background
About the problem that the server can not be started with rails s
Set the access load that can be changed graphically with JMeter (Part 1)
Library summary that seems to be often used in recent Android development (2019/11)
Scala String can be used other than java.lang.String method
Ruby array methods that can be used with Rails (other than each)
[ERROR message display] A simplified version that can be used at any time with the rails partial template.
[Swift] Color Picker that can be used with copy and paste (palette that allows you to freely select colors)
Until ruby can be used on windows ...
Initial settings until S2Dao can be used
Object-oriented that can be understood by fairies
[Android] I want to create a ViewPager that can be used for tutorials
Technology excerpt that can be used for creating EC sites in Java training
I made a THETA API client that can be used for plug-in development
A ruby ​​script that creates an rsa private key that can be used with OpenSSL from any two prime numbers
Graph the sensor information of Raspberry Pi and prepare an environment that can be checked with a web browser