Why do you need scraping?

Have you ever thought it would be interesting if you could create your own news site? If you can collect the information you want to know in one place, you can check all the information you want to know there once you open the site without having to check multiple sites. Such techniques can be achieved using scraping.

What is scraping? Environment for scraping Basic code Element specification How to get image url by scraping

What is scraping?

Scraping is "collecting data and processing it so that it is easy to use." By utilizing scraping, information can be automatically collected from multiple web pages. As an image, you will be able to get articles from multiple sites like a news app. However, when using scraping, you may be exposed to copyright issues, so please use it at your own risk.

Environment for scraping

As for my environment ・ Rails 6.0.3.4 gem ・ Nokogiri I am using.

What is nokogiri

nokogiri is a library that is loved by people who use scraping. As a feature,

You can analyze the structure of HTML and XML and process it into a form that makes it easy to specify specific elements.
You can extract elements using Xpath and CSS selectors There is a feature.

Install nokogiri

If you don't have nokogiri installed,

① In the gemfile

gem 'nokogiri'

Description of

② At the terminal

bundle install

To execute. Let's get into the scraping code as soon as we're done!

Basic code

The goal of this code is to get the title name from the article on the website and display it in the terminal.

URL of scraping destination url = 'https://news.yahoo.co.jp/pickup/6379353'

The code to display the title is It will be puts doc.title.

`ruby　test.rb`


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title

ruby test.rb

Let's run it immediately GoTo Travel to stop all over the country --Yahoo! News Is displayed, it is OK.

Element specification

Well, I have a question here, how GoTo Travel to stop all over the country --Yahoo! News Was it possible to get only the sentence? To find out, let's check the actual web page. https://news.yahoo.co.jp/pickup/6379353 When you open here スクリーンショット 2020-12-15 23.48.56.png A page like this opens.

In this state, press option + communad + I (or right-click → verify) スクリーンショット 2020-12-15 23.51.55.png Enter verification mode. By entering the verification mode, you can see how the web page is formed. In that state, press commnad + f to enter search mode. In search mode, you can search for tags, class names, etc. So type title. スクリーンショット 2020-12-15 23.56.13.png with this <title> GoTo Travel to stop all over the country --Yahoo! News </title> You can see that the place is displayed. The code of puts doc.title displays the title tag part.

Scraping image url

① Find the url you want to get

Currently only titles have been acquired. That's all I miss, right? Now let's get the url of the image in this article.

If you look for this image url in verification mode, There is a url in the picture tag. スクリーンショット 2020-12-16 9.36.20.png

Furthermore, the url starting with srcset = in the source tag in it will be the url of the image. https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP Let's get this.

② Specify url in doc.css

When getting the url, doc.css will be used. By adding css, you can search by css information in the acquired data. This image url is <div class =" sc-bCCsHx bGGhSC "> is the parent, That child is <picture> Furthermore, the child is <source type =" image/webp "srcset =" https: // news-p ~~~ It has become. So doc.css("div.bGGhSC > picture > source[1]") You can narrow down the location of the image url to some extent by writing. However, in this situation, we have not been able to narrow down to just the image url. Check bindig.pry to see what information is included.

③ Narrow down the information with binding.pry

Currently, information other than url is included, so we will narrow down to only the information you want to use binding.pry. If you don't have binding-pry installed, go to your terminal gem install pry-byebug Then on the code Describe require'pry'. Binding.pry is now available.

`ruby　test.rb`


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts doc.css("div.bGGhSC > picture > source[1]")

Let's run it now スクリーンショット 2020-12-16 10.20.29.png

Enter doc.css ("div.bGGhSC> picture> source [1] ")

スクリーンショット 2020-12-16 10.24.48.png Then, information other than url is also output doc.css("div.bGGhSC > picture > source[1]").first Let's enter. .first means the first in the array. Narrow down the information to only the first one from the previous information.

スクリーンショット 2020-12-16 10.30.23.png Further from here Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes. スクリーンショット 2020-12-16 10.32.32.png

Then narrow down by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "]. スクリーンショット 2020-12-16 10.36.03.png

Finally Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value. スクリーンショット 2020-12-16 10.36.56.png Now you can narrow down the information you specify to just the image url. In this way, by using binding.pry, you can check the output information and narrow down the information to the range you want.

④ Execution of test

Finally, execute doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value and check if the url of the image is output.

`ruby　test.rb`


##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts  doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value

Do this and https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP If it is output, it is successful.

`ruby　test.rb`



##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require "pry"

#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'

charset = nil
html = open(url) do |f|
  charset = f.charset #Get character type
  f.read #Read html and pass it to variable html
end

#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title
#Show image url
puts  doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value

Now you can output the article title and image url. I wrote the introductory part of scraping, but I hope it helps someone! Also, if you can get information more easily, please let us know in the comments! This is the end of the explanation of scraping. Thank you for reading!

Acquisition of article information in ruby scraping

Why do you need scraping?

table of contents

What is scraping?

Environment for scraping

What is nokogiri

Install nokogiri

Basic code

`ruby　test.rb`

Element specification

Scraping image url

① Find the url you want to get

② Specify url in doc.css

③ Narrow down the information with binding.pry

`ruby　test.rb`

④ Execution of test

`ruby　test.rb`

`ruby　test.rb`

Acquisition of article information in ruby ​​scraping

Why do you need scraping?

table of contents

What is scraping?

Environment for scraping

What is nokogiri

Install nokogiri

Basic code

ruby test.rb

Element specification

Scraping image url

① Find the url you want to get

② Specify url in doc.css

③ Narrow down the information with binding.pry

ruby test.rb

④ Execution of test

ruby test.rb

ruby test.rb

Acquisition of article information in ruby scraping

`ruby　test.rb`

`ruby　test.rb`

`ruby　test.rb`

`ruby　test.rb`