I decided to try scraping, so I looked it up in Node, Ruby, and Python. Try it with the content of fetching the title of google.co.jp.
Go to the page with request and use jQuery-like selectors with cheerio Try to parse and search for elements.
First install the module from the terminal.
$ npm install request cheerio
Create a file and implement it.
scrape.js
var request = require('request'),
cheerio = require('cheerio');
var url = 'http://google.co.jp';
request(url, function (error, response, body)
{
if (!error && response.statusCode === 200)
{
var $ = cheerio.load(body),
title = $('title').text();
console.log(title);
}
});
Try to run it.
$ node scrape.js Google
cheerio implements not only element acquisition but also some methods of $ .addClass, $ .append, and jQuery, so it seems to be good for cases where you manipulate the DOM.
When I went around, Nokogiri came out first. This is the de facto standard. Bring the page with open-uri and parse it with Nokogiri.
$ gem install nokogiri
Since open-uri is a standard attachment, install Nokogiri. Create a file appropriately.
scrape.rb
require 'open-uri'
require 'nokogiri'
url = 'http://www.google.co.jp/'
html = open(url)
doc = Nokogiri::HTML.parse(html)
puts doc.css('title').text
It seems that the objects returned by HTML.parse can be searched by XPath, CSS, or both. CSS selectors are easy and nice.
Try to run it.
$ ruby scrape.rb "Google"
Very good to do quickly.
I first found Scrapy, but it's a slightly larger library, so it's a little casual BeautifulSoup. ) To try it. There seems to be a standard HTMLParser, but BeautifulSoup seems to do a lot of good things.
The installation didn't work with pip, so I installed it with easy_install.
$ easy_install BeautifulSoup
The flow of fetching pages with urllib and parsing with BeautifulSoup.
scrape.py
import urllib
import BeautifulSoup
url = 'http://www.google.co.jp/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
print soup.find('title').string
Try to run it.
$ python scrape.py Google
Scrapy seems to be a library that includes crawlers and scraping. I tried it for a while, so make a note.
$ pip install scrapy
Take a quick look at the tutorial in Documentation. First, create a project template with scrapy.
$ scrapy startproject hello
Create a file directly under spiders and describe the crawler and scraping process.
hello/hello/spiders/scrape.py
from scrapy.spider import Spider
from scrapy.selector import Selector
class HelloSpider(Spider):
name = "hello"
allowed_domains = ["google.co.jp"]
start_urls = ["http://www.google.co.jp/"]
def parse(self, response):
sel = Selector(response)
title = sel.css('title::text').extract()
print title
You can use XPath or CSS selectors to get the elements. So, try running this from the terminal.
$ scrapy crawl hello
Output result
[u'Google']
The crawler is also included in the set, so it looks good to make it solid.
Recommended Posts