Have you ever thought it would be interesting if you could create your own news site? If you can collect the information you want to know in one place, you can check all the information you want to know there once you open the site without having to check multiple sites. Such techniques can be achieved using scraping.
What is scraping? Environment for scraping Basic code Element specification How to get image url by scraping
Scraping is "collecting data and processing it so that it is easy to use." By utilizing scraping, information can be automatically collected from multiple web pages. As an image, you will be able to get articles from multiple sites like a news app. However, when using scraping, you may be exposed to copyright issues, so please use it at your own risk.
As for my environment ・ Rails 6.0.3.4 gem ・ Nokogiri I am using.
nokogiri is a library that is loved by people who use scraping. As a feature,
You can analyze the structure of HTML and XML and process it into a form that makes it easy to specify specific elements.
You can extract elements using Xpath and CSS selectors There is a feature.
If you don't have nokogiri installed,
① In the gemfile
gem 'nokogiri'
Description of
② At the terminal
bundle install
To execute. Let's get into the scraping code as soon as we're done!
The goal of this code is to get the title name
from the article on the website and display it in the terminal.
URL of scraping destination url = 'https://news.yahoo.co.jp/pickup/6379353'
The code to display the title is
It will be puts doc.title
.
ruby test.rb
##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'
charset = nil
html = open(url) do |f|
charset = f.charset #Get character type
f.read #Read html and pass it to variable html
end
#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title
ruby test.rb
Let's run it immediately
GoTo Travel to stop all over the country --Yahoo! News
Is displayed, it is OK.
Well, I have a question here, how
GoTo Travel to stop all over the country --Yahoo! News
Was it possible to get only the sentence?
To find out, let's check the actual web page.
https://news.yahoo.co.jp/pickup/6379353
When you open here
A page like this opens.
In this state, press option + communad + I (or right-click → verify)
Enter verification mode.
By entering the verification mode, you can see how the web page is formed.
In that state, press commnad + f to enter search mode.
In search mode, you can search for tags, class names, etc.
So type title
.
with this
<title> GoTo Travel to stop all over the country --Yahoo! News </title>
You can see that the place is displayed.
The code of puts doc.title
displays the title tag
part.
Currently only titles have been acquired. That's all I miss, right? Now let's get the url of the image in this article.
If you look for this image url in verification mode, There is a url in the picture tag.
Furthermore, the url starting with srcset = in the source tag in it will be the url of the image. https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP Let's get this.
When getting the url, doc.css will be used.
By adding css, you can search by css information in the acquired data.
This image url is
<div class =" sc-bCCsHx bGGhSC ">
is the parent,
That child is <picture>
Furthermore, the child is <source type =" image/webp "srcset =" https: // news-p ~~~
It has become.
So
doc.css("div.bGGhSC > picture > source[1]")
You can narrow down the location of the image url to some extent by writing.
However, in this situation, we have not been able to narrow down to just the image url.
Check bindig.pry to see what information is included.
Currently, information other than url is included, so we will narrow down to only the information you want to use binding.pry
.
If you don't have binding-pry installed, go to your terminal
gem install pry-byebug
Then on the code
Describe require'pry'
.
Binding.pry is now available.
ruby test.rb
##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'
#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'
charset = nil
html = open(url) do |f|
charset = f.charset #Get character type
f.read #Read html and pass it to variable html
end
#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts doc.css("div.bGGhSC > picture > source[1]")
Let's run it now
Enter doc.css ("div.bGGhSC> picture> source [1] ")
Then, information other than url is also output
doc.css("div.bGGhSC > picture > source[1]").first
Let's enter.
.first means the first in the array. Narrow down the information to only the first one from the previous information.
Further from here
Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes
.
Then narrow down by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "]
.
Finally
Filter by doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value
.
Now you can narrow down the information you specify to just the image url.
In this way, by using binding.pry, you can check the output information and narrow down the information to the range you want.
Finally, execute doc.css ("div.bGGhSC> picture> source [1] "). first.attributes ["srcset "] .value
and check if the url of the image is output.
ruby test.rb
##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require 'pry'
#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'
charset = nil
html = open(url) do |f|
charset = f.charset #Get character type
f.read #Read html and pass it to variable html
end
#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
binding.pry
#Show image url
puts doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value
Do this and https://news-pctr.c.yimg.jp/uUzvQ3lML_bkIqyakc1vFlbRKZtM9u5XWE0uy3m1LJuztN6ELHcFKk9pTEfyITR4BzJ1biS2jSO6TBCdnPY064ZSbL8zBcwbVjqsaTANu9SaNctFdKhJXbJzQWo0hYbEH_Nc43w2vFAKuJpoajK2cMY3ybCkqvM3BoAeliLf8Bc5nGoluBfd0XLKWfTEJiQD1KfkFZJjXIF8gad270yeWdbnmatomDwSEZdIj6OnYYUxsvn-CTzFydWJAvjFMDBP If it is output, it is successful.
ruby test.rb
##Loading the library to access the URL
require 'open-uri'
#Loading the Nokogiri library
require 'nokogiri'
require "pry"
#URL of scraping destination
url = 'https://news.yahoo.co.jp/pickup/6379353'
charset = nil
html = open(url) do |f|
charset = f.charset #Get character type
f.read #Read html and pass it to variable html
end
#Parse html(analysis)To create an object
doc = Nokogiri::HTML.parse(html, nil, charset)
#Show title
puts doc.title
#Show image url
puts doc.css("div.bGGhSC > picture > source[1]").first.attributes["srcset"].value
Now you can output the article title and image url. I wrote the introductory part of scraping, but I hope it helps someone! Also, if you can get information more easily, please let us know in the comments! This is the end of the explanation of scraping. Thank you for reading!
Recommended Posts