Web scraping-> Collecting HTML data of a website to extract and format specific data.
This time, I will introduce one of the methods of Python and Ruby respectively.
Python: BeautifulSoup4
Beautiful Soup is quite useful in Python.
pip install beautifulsoup4
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen("http://example.com")
# =>Of course you can also read files.
soup = BeautifulSoup(html)
#Lots of useful methods!
soup.find_all('td')
soup.find("head").find("title")
soup.find_parents()
soup.find_parent()
soup.descendants()
#It seems that you can also rename tags, change attribute values, add and delete them!
tag = soup.a
tag.string = "New link text."
tag
# => <a href="">New link text.</a>
soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
# => <a href="">FooBar</a>
I've never used Python, but it was a lot of fun to use.
Ruby: nokogiri
gem install nokogiri
source 'https://rubygems.org'
gem 'nokogiri'
bundle
charset = nil
html = open("http://example.com") do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
doc.title
doc.xpath('//h2 | //h3').each do |link|
puts link.content
end
html = File.open('data.html', encoding: 'utf-8') { |file| file.read }
doc = Nokogiri::HTML.parse(html, nil) do |d|
d.xpath('//td').each do |td|
pp td.content
end
end
Personally, I liked Ruby after all.
Scraping with Python and Beautiful Soup-Qiita http://qiita.com/itkr/items/513318a9b5b92bd56185 kondou.com --Beautiful Soup 4.2.0 Doc. Japanese translation (2013-11-19 last updated) http://kondou.com/BS4/# Ruby scraping with Nokogiri [Tutorial for beginners] --Sake, 泪, Ruby, Rails http://morizyun.github.io/blog/ruby-nokogiri-scraping-tutorial/
Recommended Posts