[Ruby] I made a crawler with anemone and nokogiri.

When developing the system, I needed to get information from other web pages, so I created a crawler.

First thought

What I was trying to do was get the information from other web pages and save it in the DB of the app I'm currently creating. So, is it scraping to "get information" somehow? I was enthusiastic about it ...

I started the investigation and found out

What I was trying to do was scraping. However, in order to get that information, it turned out that we had to first create something called a crawler.

What is a crawler?

On the other hand, a crawler is a method of digging deep into all the links in a web page to get the desired information. The act itself is called crawling. Of course, when searching for a link with a crawler, scraping is performed and the HTML tag is parsed to obtain the link destination.

http://tech.feedforce.jp/anemone_crawler.html

What is scraping?

Scraping is the extraction of data by parsing the HTML of a web page.

webdesign-3411373_640.jpg

in short, ● Scraping can be used when information is gathered on one page. ● The crawler patrols the website.

So, I can see the flow of `` create a crawler → write the code related to scraping in it → write the description to be saved in the DB''.

Completed source code

#Use ruby gem Enabled to use nokogiri anemone
require 'nokogiri'
require 'anemone'
require 'pry'
#URL that is the starting point of the patrol
URL = 'https://********/********'.freeze

area_urls = []
prefecture_urls = []
city_urls = []
#Description that goes around the site
Anemone.crawl(URL, depth_limit: 0, delay: 1) do |anemone|
  anemone.focus_crawl do |page|
    page.links.keep_if do |link|
      link.to_s.match(%r{*********/[0-9]{1,2}})
    end
    page.links.each do |link|
      area_urls << link
    end
  end
end

area_urls.each do |area|
  Anemone.crawl(area, depth_limit: 0, delay: 1) do |anemone|
    anemone.focus_crawl do |page|
      page.links.keep_if do |link|
        link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}})
      end
      page.links.each do |link|
        prefecture_urls << link
      end
    end
  end
end

prefecture_urls.each do |prefecture|
  Anemone.crawl(prefecture, depth_limit: 1, delay: 1, skip_query_strings: true) do |anemone|
    anemone.focus_crawl do |page|
      page.links.keep_if do |link|
        link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}/[0-9]})
      end
      page.links.each do |link|
        city_urls << link
      end
    end

    PATTERN = %r[**********/[0-9]{1,2}/[0-9]{5}/[0-9]].freeze

    anemone.on_pages_like(PATTERN) do |page|
      url = page.url.to_s

      str = url.to_s

      html = URI.parse(url).open
      #Description of scraping from here
      doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
#Specify HTML with Xpath
      name = doc.xpath('/html/body/div[4]/div/div[2]/div[1]/h1').text
      pos = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[1]').text
      post = pos.strip
      postcode = post.match(/[0-9]{7}/)
      add = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[2]').text
      address = add.strip
      tel = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[4]/td').text
      fax = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[5]/td').text
      staff_number = doc.xpath('/html/body/div[4]/div/div[2]/table[4]/tbody/tr[1]/td/p').text
      company = doc.xpath('/html/body/div[4]/div/div[2]/table[5]/tbody/tr[2]/td').text
      office_url = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[6]/td/a').text
      #Extract a 5-digit number from a URL with a regular expression
      if str =~ %r{/(\d{5})(/|$)}
        city_number = Regexp.last_match(1)
        p Regexp.last_match(1)
      end
#Create an instance, save the information obtained by scraping in the DB, and ignore the validation at that time.
      offices = Office.new(name: name,
                           postcode: postcode,
                           tel: tel,
                           fax: fax,
                           address: address,
                           staff_number: staff_number,
                           company: company,
                           url: office_url,
                           city_number: city_number)
      offices.save(validate: false)
    end
  end
end




Recommended Posts

[Ruby] I made a crawler with anemone and nokogiri.
I made a risky die with Ruby
I made a portfolio with Ruby On Rails
Ruby: I made a FizzBuzz program!
I made a GUI with Swing
[Ruby] I made a simple Ping client
I made a Ruby container image and moved the Lambda function
I made a rock-paper-scissors app with kotlin
I made a rock-paper-scissors app with android
04. I made a front end with SpringBoot + Thymeleaf
I made a mosaic art with Pokemon images
I implemented Ruby with Ruby (and C) (I played with builtin)
I made a gender selection column with enum
I made blackjack with Ruby (I tried using minitest)
I made a Ruby extension library in C
I made a LINE bot with Rails + heroku
I made a Restful server and client in Spring.
I tried DI with Ruby
I made a chat app.
I made a development environment with rails6 + docker + postgreSQL + Materialize.
I made an interpreter (compiler?) With about 80 lines in Ruby.
I made a plugin to execute jextract with Gradle task
I want to make a list with kotlin and java!
I want to make a function with kotlin and java!
I updated my own blackjack made with Ruby for my portfolio
I searched for a web framework with Gem in Ruby
I tried JAX-RS and made a note of the procedure
I made a mod that instantly calls a vehicle with Minecraft
Install Ruby 3.0.0 Preview 1 with a combination of Homebrew and rbenv
I tried printing a form with Spring MVC and JasperReports 1/3 (JasperReports settings)
I made a command line interface with WinMerge Plugin using JD-Core
[Rails] I made a simple calendar mini app with customized specifications.
I tried printing a form with Spring MVC and JasperReports 3/3 (Spring MVC control)
I want to add a browsing function with ruby on rails
I made a shopify app @java
I made a simple search form with Spring Boot + GitHub Search API.
I made a simple recommendation function.
I made a matching app (Android app)
I made a package.xml generation tool.
[Android] I made a pedometer app.
Make a typing game with ruby
I want to download a file on the Internet using Ruby and save it locally (with caution)
I built a rails environment with docker and mysql, but I got stuck
I made an app to scribble with PencilKit on a PDF file
I was a little addicted to running old Ruby environment and old Rails
I made a virtual currency arbitrage bot and tried to make money
I wrote a Lambda function in Java and deployed it with SAM
I made a class that can use JUMAN and KNP from Java
[LINE BOT] I made a ramen BOT with Java (Maven) + Heroku + Spring Boot (1)
I made a site that summarizes information on carbohydrate restriction with Vue.js
With ruby ● × Game and Othello (basic review)
I tried a calendar problem in Ruby
I made various Fibonacci sequence functions (Ruby)
I made an eco server with scala
Draw a graph with Sinatra and Chartkick
Let's make a smart home with Ruby!
I tried playing with BottomNavigationView a little ①
Extract a part of a string with Ruby
I made a plugin for IntelliJ IDEA
I made a calculator app on Android
I made a new Java deployment tool