Acquisition of article information in ruby ​​scraping

Why do you need scraping?

Have you ever thought it would be interesting if you could create your own news site? If you can collect the information you want to know in one place, you can check all the information you want to know there once you open the site without having to check multiple sites. Such techniques can be achieved using scraping.

table of contents

What is scraping? Environment for scraping Basic code Element specification How to get image url by scraping

What is scraping?

Scraping is "collecting data and processing it so that it is easy to use." By utilizing scraping, information can be automatically collected from multiple web pages. As an image, you will be able to get articles from multiple sites like a news app. However, when using scraping, you may be exposed to copyright issues, so please use it at your own risk.

Environment for scraping

As for my environment ・ Rails 6.0.3.4 gem ・ Nokogiri I am using.

What is nokogiri

nokogiri is a library that is loved by people who use scraping. As a feature,

Install nokogiri

If you don't have nokogiri installed,

① In the gemfile

gem 'nokogiri'

Description of

② At the terminal

bundle install

To execute. Let's get into the scraping code as soon as we're done!

Basic code

The goal of this code is to get the title name from the article on the website and display it in the terminal.

URL of scraping destination url = 'https://news.yahoo.co.jp/pickup/6379353'

The code to display the title is It will be puts doc.title.

ruby test.rb


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title
ruby test.rb

Let's run it immediately GoTo Travel to stop all over the country --Yahoo! News Is displayed, it is OK.

Element specification

Well, I have a question here, how GoTo Travel to stop all over the country --Yahoo! News Was it possible to get only the sentence? To find out, let's check the actual web page. https://news.yahoo.co.jp/pickup/6379353 When you open here スクリーンショット 2020-12-15 23.48.56.png A page like this opens.

In this state, press option + communad + I (or right-click → verify) スクリーンショット 2020-12-15 23.51.55.png Enter verification mode. By entering the verification mode, you can see how the web page is formed. In that state, press commnad + f to enter search mode. In search mode, you can search for tags, class names, etc. So type title. スクリーンショット 2020-12-15 23.56.13.png with this <title> GoTo Travel to stop all over the country --Yahoo! News </title> You can see that the place is displayed. The code of puts doc.title displays the title tag part.

Scraping image url

① Find the url you want to get

Currently only titles have been acquired. That's all I miss, right? Now let's get the url of the image in this article.

If you look for this image url in verification mode, There is a url in the picture tag. スクリーンショット 2020-12-16 9.36.20.png

Furthermore, the url starting with srcset = in the source tag in it will be the url of the image. https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP Let's get this.

② Specify url in doc.css

When getting the url, doc.css will be used. By adding css, you can search by css information in the acquired data. This image url is <div class =" sc-bCCsHx bGGhSC "> is the parent, That child is <picture> Furthermore, the child is <source type =" image/webp "srcset =" https: // news-p ~~~ It has become. So doc.css("div.bGGhSC > picture > source[1]") You can narrow down the location of the image url to some extent by writing. However, in this situation, we have not been able to narrow down to just the image url. Check bindig.pry to see what information is included.

③ Narrow down the information with binding.pry

Currently, information other than url is included, so we will narrow down to only the information you want to use binding.pry. If you don't have binding-pry installed, go to your terminal gem install pry-byebug Then on the code Describe require'pry'. Binding.pry is now available.

ruby test.rb


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts doc.css("div.bGGhSC > picture > source[1]")

Let's run it now スクリーンショット 2020-12-16 10.20.29.png

Enter doc.css ("div.bGGhSC> picture> source [1] ")

スクリーンショット 2020-12-16 10.24.48.png Then, information other than url is also output doc.css("div.bGGhSC > picture > source[1]").first Let's enter. .first means the first in the array. Narrow down the information to only the first one from the previous information.

スクリーンショット 2020-12-16 10.30.23.png Further from here Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes. スクリーンショット 2020-12-16 10.32.32.png

Then narrow down by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "]. スクリーンショット 2020-12-16 10.36.03.png

Finally Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value. スクリーンショット 2020-12-16 10.36.56.png Now you can narrow down the information you specify to just the image url. In this way, by using binding.pry, you can check the output information and narrow down the information to the range you want.

④ Execution of test

Finally, execute doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value and check if the url of the image is output.

ruby test.rb


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts  doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value

Do this and https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP If it is output, it is successful.

ruby test.rb



##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require "pry"

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title
#Show image url
puts  doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value

Now you can output the article title and image url. I wrote the introductory part of scraping, but I hope it helps someone! Also, if you can get information more easily, please let us know in the comments! This is the end of the explanation of scraping. Thank you for reading!

Recommended Posts

Acquisition of article information in ruby ​​scraping
Directory information of DEFAULT_CERT_FILE in Mac ruby 2.0.0
Judgment of fractions in Ruby
Basics of sending Gmail in Ruby
Implementation of ls command in Ruby
Acquisition of location information using FusedLocationProviderClient
Ruby memorandum (acquisition of key value)
openssl version information in ruby OPENSSL_VERSION
Summary of hashes and symbols in Ruby
[Ruby] Classification and usage of loops in Ruby
[Ruby] Behavior of evaluation of conditional expression in while
Recommendation of Service class in Ruby on Rails
Class in Ruby
Basics of Ruby
Heavy in Ruby! ??
Enumerate subsets of arrays given in Ruby (+ α)
Create a native extension of Ruby in Rust
Count the number of occurrences of a string in Ruby
[Ruby] The role of subscripts in learning elements in arrays
Let's write a Qiita article in org-mode of Emacs !!
[For beginners] ○○. △△ in Ruby (ActiveRecord method, instance method, data acquisition)
Get the URL of the HTTP redirect destination in Ruby
Handling of date and time in Ruby. Use Date and Time properly.
About eval in Ruby
Summary of information security
definition of ruby method
Output triangle in Ruby
Variable type in ruby
Output in multiples of 3
Fast popcount in Ruby
Scraping for beginners (Ruby)
Determine that the value is a multiple of 〇 in Ruby
Microbenchmark for integer power of floating point numbers in Ruby
Handling of line beginning and line ending in regular expressions in Ruby