[Ruby] I made a crawler with anemone and nokogiri.

When developing the system, I needed to get information from other web pages, so I created a crawler.

First thought

What I was trying to do was get the information from other web pages and save it in the DB of the app I'm currently creating. So, is it scraping to "get information" somehow? I was enthusiastic about it ...

I started the investigation and found out

What I was trying to do was scraping. However, in order to get that information, it turned out that we had to first create something called a crawler.

What is a crawler?

On the other hand, a crawler is a method of digging deep into all the links in a web page to get the desired information. The act itself is called crawling. Of course, when searching for a link with a crawler, scraping is performed and the HTML tag is parsed to obtain the link destination.

http://tech.feedforce.jp/anemone_crawler.html

What is scraping?

Scraping is the extraction of data by parsing the HTML of a web page.

in short, ● Scraping can be used when information is gathered on one page. ● The crawler patrols the website.

So, I can see the flow of `` create a crawler → write the code related to scraping in it → write the description to be saved in the DB''.

Completed source code

#Use ruby gem Enabled to use nokogiri anemone
require 'nokogiri'
require 'anemone'
require 'pry'
#URL that is the starting point of the patrol
URL = 'https://********/********'.freeze

area_urls = []
prefecture_urls = []
city_urls = []
#Description that goes around the site
Anemone.crawl(URL, depth_limit: 0, delay: 1) do |anemone|
  anemone.focus_crawl do |page|
    page.links.keep_if do |link|
      link.to_s.match(%r{*********/[0-9]{1,2}})
    end
    page.links.each do |link|
      area_urls << link
    end
  end
end

area_urls.each do |area|
  Anemone.crawl(area, depth_limit: 0, delay: 1) do |anemone|
    anemone.focus_crawl do |page|
      page.links.keep_if do |link|
        link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}})
      end
      page.links.each do |link|
        prefecture_urls << link
      end
    end
  end
end

prefecture_urls.each do |prefecture|
  Anemone.crawl(prefecture, depth_limit: 1, delay: 1, skip_query_strings: true) do |anemone|
    anemone.focus_crawl do |page|
      page.links.keep_if do |link|
        link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}/[0-9]})
      end
      page.links.each do |link|
        city_urls << link
      end
    end

    PATTERN = %r[**********/[0-9]{1,2}/[0-9]{5}/[0-9]].freeze

    anemone.on_pages_like(PATTERN) do |page|
      url = page.url.to_s

      str = url.to_s

      html = URI.parse(url).open
      #Description of scraping from here
      doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
#Specify HTML with Xpath
      name = doc.xpath('/html/body/div[4]/div/div[2]/div[1]/h1').text
      pos = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[1]').text
      post = pos.strip
      postcode = post.match(/[0-9]{7}/)
      add = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[2]').text
      address = add.strip
      tel = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[4]/td').text
      fax = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[5]/td').text
      staff_number = doc.xpath('/html/body/div[4]/div/div[2]/table[4]/tbody/tr[1]/td/p').text
      company = doc.xpath('/html/body/div[4]/div/div[2]/table[5]/tbody/tr[2]/td').text
      office_url = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[6]/td/a').text
      #Extract a 5-digit number from a URL with a regular expression
      if str =~ %r{/(\d{5})(/|$)}
        city_number = Regexp.last_match(1)
        p Regexp.last_match(1)
      end
#Create an instance, save the information obtained by scraping in the DB, and ignore the validation at that time.
      offices = Office.new(name: name,
                           postcode: postcode,
                           tel: tel,
                           fax: fax,
                           address: address,
                           staff_number: staff_number,
                           company: company,
                           url: office_url,
                           city_number: city_number)
      offices.save(validate: false)
    end
  end
end