Scraping with Node, Ruby and Python

I decided to try scraping, so I looked it up in Node, Ruby, and Python. Try it with the content of fetching the title of google.co.jp.

Try it with Node

Go to the page with request and use jQuery-like selectors with cheerio Try to parse and search for elements.

First install the module from the terminal.

$ npm install request cheerio

Create a file and implement it.

`scrape.js`


var request = require('request'),
    cheerio = require('cheerio');

var url = 'http://google.co.jp';

request(url, function (error, response, body)
{
    if (!error && response.statusCode === 200)
    {
        var $ = cheerio.load(body),
            title = $('title').text();
        console.log(title);
    }
});

Try to run it.

$ node scrape.js Google

cheerio implements not only element acquisition but also some methods of $ .addClass, $ .append, and jQuery, so it seems to be good for cases where you manipulate the DOM.

Try it with Ruby

When I went around, Nokogiri came out first. This is the de facto standard. Bring the page with open-uri and parse it with Nokogiri.

$ gem install nokogiri

Since open-uri is a standard attachment, install Nokogiri. Create a file appropriately.

`scrape.rb`


require 'open-uri'
require 'nokogiri'

url = 'http://www.google.co.jp/'
html = open(url)
doc = Nokogiri::HTML.parse(html)

puts doc.css('title').text

It seems that the objects returned by HTML.parse can be searched by XPath, CSS, or both. CSS selectors are easy and nice.

Try to run it.

$ ruby scrape.rb "Google"

Very good to do quickly.

Try it with Python

I first found Scrapy, but it's a slightly larger library, so it's a little casual BeautifulSoup. ) To try it. There seems to be a standard HTMLParser, but BeautifulSoup seems to do a lot of good things.

The installation didn't work with pip, so I installed it with easy_install.

$ easy_install BeautifulSoup

The flow of fetching pages with urllib and parsing with BeautifulSoup.

`scrape.py`


import urllib
import BeautifulSoup

url = 'http://www.google.co.jp/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)

print soup.find('title').string

Try to run it.

$ python scrape.py Google

Try it with Scrapy (Python)

Scrapy seems to be a library that includes crawlers and scraping. I tried it for a while, so make a note.

$ pip install scrapy

Take a quick look at the tutorial in Documentation. First, create a project template with scrapy.

$ scrapy startproject hello

Create a file directly under spiders and describe the crawler and scraping process.

`hello/hello/spiders/scrape.py`


from scrapy.spider import Spider
from scrapy.selector import Selector

class HelloSpider(Spider):
    name = "hello"
    allowed_domains = ["google.co.jp"]
    start_urls = ["http://www.google.co.jp/"]

    def parse(self, response):
        sel = Selector(response)
        title = sel.css('title::text').extract()
        print title

You can use XPath or CSS selectors to get the elements. So, try running this from the terminal.

$ scrapy crawl hello

Output result

[u'Google']

The crawler is also included in the set, so it looks good to make it solid.