Introduction to Scrapy (2)

Introduction

In the previous article, as a first step in Scrapy, I created a simple Spider and tried the URL extraction process. Scrapy can not only get web pages, but also get Web API results, download images, and more. This time, let's create a Spider that calls the Web API and saves the result.

Creating a Spider

Create a Spider to get station information based on the list of zip codes. To get the zip code, use the API (http://geoapi.heartrails.com/api.html) provided by HeartRails. The Spider looks like this:

`get_station_spider.py`


# -*- encoding:utf-8 -*-

import json

from scrapy import Spider
from scrapy.http import Request


class GetStationSpider(Spider):
    name = "get_station_spider"
    allowed_domains = ["express.heartrails.com"]
    end_point = "http://geoapi.heartrails.com/api/json?method=getStations&postal=%s"

    custom_settings = {
        "DOWNLOAD_DELAY": 1.5,
    }

    #List of zip codes (originally obtained from DB etc.)
    postal_list = [
        1080072,
        1050013,
        1350063,
        1020072,
        9012206,
    ]

    #This method is called when Spider is started. Make a request to call the API.
    def start_requests(self):
        for postal in self.postal_list:
            url = self.end_point % postal
            yield Request(url, self.parse)

    #A method called after the download is complete. Extract information from the response and return it in dictionary format
    def parse(self, response):
        response = json.loads(response.body)
        result = response['response']['station'][0]

        yield {
            'postal': result["postal"],
            'name': result["name"],
            'line': result["line"],
            'latitude': result["y"],
            'longitude': result["x"],
            'prev': result['prev'],
            'next': result['next'],
        }

Run

Just like last time, use the commands that come with Scrapy to crawl. Use the -o option to print the crawl results to stations.json.

scrapy runspider get_station_spider.py -o stations.json

result

[
  {
    "prev": "\u767d\u91d1\u53f0",
    "name": "\u767d\u91d1\u9ad8\u8f2a",
    "longitude": 139.734286,
    "next": "\u4e09\u7530",
    "latitude": 35.643147,
    "line": "\u90fd\u55b6\u4e09\u7530\u7dda",
    "postal": "1080072"
  },
  {
    "prev": null,
    "name": "\u30e2\u30ce\u30ec\u30fc\u30eb\u6d5c\u677e\u753a",
    "longitude": 139.75667,
    "next": "\u5929\u738b\u6d32\u30a2\u30a4\u30eb",
    "latitude": 35.655746,
    "line": "\u6771\u4eac\u30e2\u30ce\u30ec\u30fc\u30eb\u7fbd\u7530\u7dda",
    "postal": "1050013"
  },
  {
    "prev": "\u9752\u6d77",
    "name": "\u56fd\u969b\u5c55\u793a\u5834\u6b63\u9580",
    "longitude": 139.7913,
    "next": "\u6709\u660e",
    "latitude": 35.630212,
    "line": "\u65b0\u4ea4\u901a\u3086\u308a\u304b\u3082\u3081",
    "postal": "1350063"
  },
  {
    "prev": "\u795e\u697d\u5742",
    "name": "\u98ef\u7530\u6a4b",
    "longitude": 139.746657,
    "next": "\u4e5d\u6bb5\u4e0b",
    "latitude": 35.701332,
    "line": "\u6771\u4eac\u30e1\u30c8\u30ed\u6771\u897f\u7dda",
    "postal": "1020072"
  },
  {
    "prev": "\u5e02\u7acb\u75c5\u9662\u524d",
    "name": "\u5100\u4fdd",
    "longitude": 127.719295,
    "next": "\u9996\u91cc",
    "latitude": 26.224491,
    "line": "\u6c96\u7e04\u3086\u3044\u30ec\u30fc\u30eb",
    "postal": "9030821"
  }
]

At the end

With Scrapy, it's very easy to write everything from calling a Web API to saving the execution results. Developers only need to create classes and functions that are called from the framework side, so it is possible to concentrate on the more essential parts. Next time, I will cover the image file download process. looking forward to!