When developing the system, I needed to get information from other web pages, so I created a crawler.
What I was trying to do was get the information from other web pages and save it in the DB of the app I'm currently creating. So, is it scraping to "get information" somehow? I was enthusiastic about it ...
What I was trying to do was scraping. However, in order to get that information, it turned out that we had to first create something called a crawler.
On the other hand, a crawler is a method of digging deep into all the links in a web page to get the desired information. The act itself is called crawling. Of course, when searching for a link with a crawler, scraping is performed and the HTML tag is parsed to obtain the link destination.
http://tech.feedforce.jp/anemone_crawler.html
Scraping is the extraction of data by parsing the HTML of a web page.
in short, ● Scraping can be used when information is gathered on one page. ● The crawler patrols the website.
So, I can see the flow of `` create a crawler → write the code related to scraping in it → write the description to be saved in the DB''.
#Use ruby gem Enabled to use nokogiri anemone
require 'nokogiri'
require 'anemone'
require 'pry'
#URL that is the starting point of the patrol
URL = 'https://********/********'.freeze
area_urls = []
prefecture_urls = []
city_urls = []
#Description that goes around the site
Anemone.crawl(URL, depth_limit: 0, delay: 1) do |anemone|
anemone.focus_crawl do |page|
page.links.keep_if do |link|
link.to_s.match(%r{*********/[0-9]{1,2}})
end
page.links.each do |link|
area_urls << link
end
end
end
area_urls.each do |area|
Anemone.crawl(area, depth_limit: 0, delay: 1) do |anemone|
anemone.focus_crawl do |page|
page.links.keep_if do |link|
link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}})
end
page.links.each do |link|
prefecture_urls << link
end
end
end
end
prefecture_urls.each do |prefecture|
Anemone.crawl(prefecture, depth_limit: 1, delay: 1, skip_query_strings: true) do |anemone|
anemone.focus_crawl do |page|
page.links.keep_if do |link|
link.to_s.match(%r{**********/[0-9]{1,2}/[0-9]{5}/[0-9]})
end
page.links.each do |link|
city_urls << link
end
end
PATTERN = %r[**********/[0-9]{1,2}/[0-9]{5}/[0-9]].freeze
anemone.on_pages_like(PATTERN) do |page|
url = page.url.to_s
str = url.to_s
html = URI.parse(url).open
#Description of scraping from here
doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
#Specify HTML with Xpath
name = doc.xpath('/html/body/div[4]/div/div[2]/div[1]/h1').text
pos = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[1]').text
post = pos.strip
postcode = post.match(/[0-9]{7}/)
add = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[3]/td/text()[2]').text
address = add.strip
tel = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[4]/td').text
fax = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[5]/td').text
staff_number = doc.xpath('/html/body/div[4]/div/div[2]/table[4]/tbody/tr[1]/td/p').text
company = doc.xpath('/html/body/div[4]/div/div[2]/table[5]/tbody/tr[2]/td').text
office_url = doc.xpath('/html/body/div[4]/div/div[2]/table[1]/tbody/tr[6]/td/a').text
#Extract a 5-digit number from a URL with a regular expression
if str =~ %r{/(\d{5})(/|$)}
city_number = Regexp.last_match(1)
p Regexp.last_match(1)
end
#Create an instance, save the information obtained by scraping in the DB, and ignore the validation at that time.
offices = Office.new(name: name,
postcode: postcode,
tel: tel,
fax: fax,
address: address,
staff_number: staff_number,
company: company,
url: office_url,
city_number: city_number)
offices.save(validate: false)
end
end
end
Recommended Posts