Introduction to Scrapy (2)

Introduction to Scrapy (2)

Introduction

Introduction to Scrapy (1)

In the previous article, as a first step in Scrapy, I created a simple Spider and tried the URL extraction process. Scrapy can not only get web pages, but also get Web API results, download images, and more. This time, let's create a Spider that calls the Web API and saves the result.

Creating a Spider

Create a Spider to get station information based on the list of zip codes. To get the zip code, use the API (http://geoapi.heartrails.com/api.html) provided by HeartRails. The Spider looks like this:

get_station_spider.py


# -*- encoding:utf-8 -*-

import json

from scrapy import Spider
from scrapy.http import Request


class GetStationSpider(Spider):
    name = "get_station_spider"
    allowed_domains = ["express.heartrails.com"]
    end_point = "http://geoapi.heartrails.com/api/json?method=getStations&postal=%s"

    custom_settings = {
        "DOWNLOAD_DELAY": 1.5,
    }

    #List of zip codes (originally obtained from DB etc.)
    postal_list = [
        1080072,
        1050013,
        1350063,
        1020072,
        9012206,
    ]

    #This method is called when Spider is started. Make a request to call the API.
    def start_requests(self):
        for postal in self.postal_list:
            url = self.end_point % postal
            yield Request(url, self.parse)

    #A method called after the download is complete. Extract information from the response and return it in dictionary format
    def parse(self, response):
        response = json.loads(response.body)
        result = response['response']['station'][0]

        yield {
            'postal': result["postal"],
            'name': result["name"],
            'line': result["line"],
            'latitude': result["y"],
            'longitude': result["x"],
            'prev': result['prev'],
            'next': result['next'],
        }

Run

Just like last time, use the commands that come with Scrapy to crawl. Use the -o option to print the crawl results to stations.json.

scrapy runspider get_station_spider.py -o stations.json

result

[
  {
    "prev": "\u767d\u91d1\u53f0",
    "name": "\u767d\u91d1\u9ad8\u8f2a",
    "longitude": 139.734286,
    "next": "\u4e09\u7530",
    "latitude": 35.643147,
    "line": "\u90fd\u55b6\u4e09\u7530\u7dda",
    "postal": "1080072"
  },
  {
    "prev": null,
    "name": "\u30e2\u30ce\u30ec\u30fc\u30eb\u6d5c\u677e\u753a",
    "longitude": 139.75667,
    "next": "\u5929\u738b\u6d32\u30a2\u30a4\u30eb",
    "latitude": 35.655746,
    "line": "\u6771\u4eac\u30e2\u30ce\u30ec\u30fc\u30eb\u7fbd\u7530\u7dda",
    "postal": "1050013"
  },
  {
    "prev": "\u9752\u6d77",
    "name": "\u56fd\u969b\u5c55\u793a\u5834\u6b63\u9580",
    "longitude": 139.7913,
    "next": "\u6709\u660e",
    "latitude": 35.630212,
    "line": "\u65b0\u4ea4\u901a\u3086\u308a\u304b\u3082\u3081",
    "postal": "1350063"
  },
  {
    "prev": "\u795e\u697d\u5742",
    "name": "\u98ef\u7530\u6a4b",
    "longitude": 139.746657,
    "next": "\u4e5d\u6bb5\u4e0b",
    "latitude": 35.701332,
    "line": "\u6771\u4eac\u30e1\u30c8\u30ed\u6771\u897f\u7dda",
    "postal": "1020072"
  },
  {
    "prev": "\u5e02\u7acb\u75c5\u9662\u524d",
    "name": "\u5100\u4fdd",
    "longitude": 127.719295,
    "next": "\u9996\u91cc",
    "latitude": 26.224491,
    "line": "\u6c96\u7e04\u3086\u3044\u30ec\u30fc\u30eb",
    "postal": "9030821"
  }
]

At the end

With Scrapy, it's very easy to write everything from calling a Web API to saving the execution results. Developers only need to create classes and functions that are called from the framework side, so it is possible to concentrate on the more essential parts. Next time, I will cover the image file download process. looking forward to!

Recommended Posts

Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Scrapy (2)
Introduction to Scrapy (4)
Introduction to MQTT (Introduction)
Introduction to Tkinter 1: Introduction
Introduction to PyQt
[Linux] Introduction to Linux
Introduction to discord.py (2)
Introduction to discord.py
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
Introduction to machine learning
A quick introduction to pytest-mock
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Python Django (2) Win
Introduction to Cython Writing [Notes]
An introduction to private TensorFlow
Kubernetes Scheduler Introduction to Homebrew
An introduction to machine learning
[Introduction to cx_Oracle] Overview of cx_Oracle
A super introduction to Linux
Introduction
AOJ Introduction to Programming Topic # 7, Topic # 8
[Introduction to pytorch-lightning] First Lit ♬
Introduction to Anomaly Detection 1 Basics
Introduction to RDB with sqlalchemy Ⅰ
[Introduction to Systre] Fibonacci Retracement ♬
Introduction to Nonlinear Optimization (I)
Introduction to serial communication [Python]
AOJ Introduction to Programming Topic # 5, Topic # 6
Introduction to Deep Learning ~ Learning Rules ~
[Introduction to Python] <list> [edit: 2020/02/22]
Introduction to Python (Python version APG4b)
An introduction to Python Programming
[Introduction to cx_Oracle] (8th) cx_Oracle 8.0 release
Introduction to discord.py (3) Using voice
An introduction to Bayesian optimization
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Super introduction to machine learning
Introduction to Ansible Part ③'Inventory'
Series: Introduction to cx_Oracle Contents
[Introduction] How to use open3d
Introduction to Python For, While
Introduction to Deep Learning ~ Backpropagation ~
Introduction to Ansible Part ④'Variable'
Introduction to vi command (memorandum)
Introduction to Linux Commands ~ LS-DYNA Edition ~
[Introduction to Udemy Python 3 + Application] 58. Lambda