[Ruby] Scraping yahoo news with [Ruby+Nokogiri] → Save as CSV

2 minute read

I used to use Python’s BeautifulSoup for scraping, but I tried it because it can be implemented in Ruby with a library called Nokogiri.

First of all, from the finished code and finished product

scraping.rb



require'nokogiri'
require'open-uri'
require "csv"

require "byebug"


url_base = "https://news.yahoo.co.jp/"

def get_categories(url)
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  categories = doc.css(".yjnHeader_sub_cat li a")
  categories.map do |category|
    cat_name = category.text
    cat = category[:href]
  end
end

@cat_list = get_categories(url_base)
@infos = []


@cat_list.each do |cat|
  url = "#{url_base + cat}"
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  titles = doc.css(".topicsListItem a")
  i = 1
  titles.each do |title|
    @infos << [i,title.text]
    i += 1
  end
end

CSV.open("result.csv", "w") do |csv|
  @infos.each do |info|
    csv << info
    puts "-------------------------------"
    puts info
  end
end

Screenshot (94).png

Each will be explained.

Read file

require'nokogiri'
require'open-uri'
require "csv"

require "byebug"

I will use Nokogiri and open-uri and csv for saving CSV.

Nokogiri is a Ruby library that parses HTML and XML code and extracts it with a selector. Since the selector can be specified by xpath in addition to css, scraping can be performed smoothly even on pages with complicated structures.

Scraping destination page structure

Screenshot (88).png Screenshot (89).png

This time, we will acquire the titles of each topic and finally put them together in a CSV file.

The topic page seems to be connected from the link (a tag) in the li of the class yjnHeader_sub.

url_base = "https://news.yahoo.co.jp/"

def get_categories(url)
  html = open(url)
  Get the HTML code of the URL read by #parse
  doc = Nokogiri::HTML.parse(html)
  Use the #css selector to get all the a tags linked to the previous category
  categories = doc.css(".yjnHeader_sub_cat li a")
  categories.map do |category|
# Retrieve the contents of the href (link URL) one by one from the acquired a tag
    cat = category[:href]
  end
end

The links acquired as #@cat_list are summarized.
@cat_list = get_categories(url_base)

Get topic title

Use the link you got earlier to get the title for each topic.


@infos = []

@cat_list.each do |cat|
# The URL of the topic page is the original URL + the acquired URL
  url = "#{url_base + cat}"
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  titles = doc.css(".topicsListItem a")
  i = 1
  titles.each do |title|
Store the topic number and title as a set to be summarized in #CSV
    @infos << [i,title.text]
    i += 1
  end
end

Combine the acquired titles into CSV

Save the last summarized title to CSV.


Create a new "result.csv" using #CSV library
CSV.open("result.csv", "w") do |csv|
  @infos.each do |info|
Items added to #csv and used as logs are output.
    csv << info
    puts "-------------------------------"
    puts info
  end
end

Screenshot (90).png

Countermeasure for garbled characters

However, if it is left as it is, it will probably be garbled, so save it again with a BOM. (Originally, it is correct to do it while saving CSV, but since it did not work well, I responded here)

Open “result.csv” with Notepad and select overwrite save.

Screenshot (92).png

Screenshot (93).png

At this time, select UTF-8 (with BOM) and save again.

Screenshot (94).png

If you open csv again, the garbled characters are gone.

Finally

I think that there are still many points that I have not reached, so I would appreciate if you could point out any comments.