Scraping yahoo news with [Ruby + Nokogiri] → Save CSV

I used to scrape using Python's ** BeautifulSoup **, but I tried it because it can be realized with a library called ** Nokogiri ** in Ruby as well.

First of all, from the completed code and the finished product

`scraping.rb`



require 'nokogiri'
require 'open-uri'
require "csv"

require "byebug"


url_base = "https://news.yahoo.co.jp/"

def get_categories(url)
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  categories = doc.css(".yjnHeader_sub_cat li a")
  categories.map do |category|
    cat_name = category.text
    cat = category[:href]
  end
end

@cat_list = get_categories(url_base)
@infos = []


@cat_list.each do |cat|
  url = "#{url_base + cat}"
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  titles = doc.css(".topicsListItem a")
  i = 1
  titles.each do |title|
    @infos << [i,title.text]
    i += 1
  end
end

CSV.open("result.csv", "w") do |csv|
  @infos.each do |info|
    csv << info
    puts "-------------------------------"
    puts info
  end
end

スクリーンショット (94).png

I will explain each of them.

Read file

require 'nokogiri'
require 'open-uri'
require "csv"

require "byebug"

This time I will use ** Nokogiri and open-uri **, and ** csv ** for CSV storage.

Nokogiri is a Ruby library that parses HTML and XML code and extracts them with selectors. The selector can be specified by ** xpath ** in addition to ** css **, so scraping can be done smoothly even on pages with complicated structures.

Page structure of scraping destination

スクリーンショット (88).png スクリーンショット (89).png

This time, we will get the title of each topic and finally put it together in a CSV file.

The topic page seems to be connected from the link (a tag) in the li of the class ** yjnHeader_sub **.

Get links by category

url_base = "https://news.yahoo.co.jp/"

def get_categories(url)
  html = open(url)
  #Get the HTML code of the URL read by parse
  doc = Nokogiri::HTML.parse(html)
  #Use css selector to get all a tags connected to the previous category
  categories = doc.css(".yjnHeader_sub_cat li a")
  categories.map do |category|
    #Contents of href from the acquired a tag(URL of the link destination)Take out one by one and return
    cat = category[:href]
  end
end

#@cat_I will summarize the links obtained as a list
@cat_list = get_categories(url_base)

Get the title of the topic

We will get the title for each topic using the link we got earlier.


@infos = []

@cat_list.each do |cat|
　#The URL of the topic page is the original URL+Because of the obtained URL
  url = "#{url_base + cat}"
  html = open(url)
  doc = Nokogiri::HTML.parse(html)
  titles = doc.css(".topicsListItem a")
  i = 1
  titles.each do |title|
　　#Store topic numbers and titles as a set for summarizing in CSV
    @infos << [i,title.text]
    i += 1
  end
end

Collect the acquired titles in CSV

Save the last summarized title in CSV.


#Using CSV library"result.csv"Newly created
CSV.open("result.csv", "w") do |csv|
  @infos.each do |info|
　　#The items used as logs are output while being added to csv.
    csv << info
    puts "-------------------------------"
    puts info
  end
end

スクリーンショット (90).png

Measures against garbled characters

However, if it is left as it is, the characters will probably be garbled, so save it again with a BOM. (Originally, it was correct to do it while saving CSV, but it didn't work, so I took care of it here.)

Open "result.csv" with ** Notepad ** and select Overwrite.

スクリーンショット (92).png

スクリーンショット (93).png

At this time, select ** UTF-8 (with BOM) ** and save again.

スクリーンショット (94).png

When you open csv again, the garbled characters are resolved.

Finally

I think there are still many points that have yet to be reached, so if you have any suggestions, I would appreciate it if you could comment.