Purpose

With the national land numerical information download service, you can obtain data managed by the Ministry of Land, Infrastructure, Transport and Tourism. This time, I will plot the coordinates on Google Map using railway data.

demo: http://needtec.sakura.ne.jp/railway_location/railway

GIT： https://github.com/mima3/railway_location

About data

Railway data can be downloaded from the following page.

** National land numerical information Railway data ** http://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N02-v2_2.html

Please refer to the following for the use of XML in the downloaded file. http://nlftp.mlit.go.jp/ksj/gml/product_spec/KS-PS-N02-v2_1.pdf

Simply put, railroad data contains information that shows the shape of the line and station information. Here, the important elements are: ・ Gml: Curve curve information ・ Ksj: railroadSection Railway section information ・ Ksj: Station Station information

Coordinate information is stored in Curve. The link to Curve is stored in the location element of railroadSection and Station.

Storage in database

Since it is difficult to handle a large amount of data in XML, it is temporarily stored in a relational database.

At this time, a large size XML is analyzed, but if the entire XML file is once stored in memory and parsed, the memory usage will increase dramatically and it will not be possible to process it. Therefore, use lxml.etree.iterparse to process sequentially.

However, when parsing N02-XX.xml with lxml.etree.itreparse, an error occurs. This is because there is a line in the XML that looks like this:

 xmlns:schemaLocation="http://nlftp.mlit.go.jp/ksj/schemas/ksj-app KsjAppSchema-N02-v2_0.xsd">

lxml considers the URI specified here as an invalid URI and outputs an error. To avoid this, it is necessary to specify recover = True when parsing XML in lxml. http://stackoverflow.com/questions/18692965/how-do-i-skip-validating-the-uri-in-lxml

** Workaround: **

        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://www.opengis.net/gml/3.2}Curve',
            recover=True
        )

In iterparse, this argument was introduced after lxml == 3.4.1, so you need to specify the version to install lxml.

easy_install lxml==3.4.1

Based on the above, the process of importing railway data XML into the database is as follows.

`railway_db.py`


# -*- coding: utf-8 -*-
import sqlite3
import sys
import os
# easy_install lxml==3.4.1
from lxml import etree
from peewee import *

database_proxy = Proxy()
database = None


class BaseModel(Model):
    """
Model class base
    """
    class Meta:
        database = database_proxy


class Curve(BaseModel):
    """
Curve information model
    """
    curve_id = CharField(index=True, unique=False)
    lat = DoubleField()
    lng = DoubleField()


class RailRoadSection(BaseModel):
    """
Railway section information model
    """
    gml_id = CharField(primary_key=True)
    #Foreign keys must have a primary key or a unique constraint, so
    #It cannot be specified as a foreign key for multiple data.
    location = CharField(index=True)
    railway_type = IntegerField()
    service_provider_type = IntegerField()
    railway_line_name = CharField(index=True)
    operation_company = CharField(index=True)


class Station(BaseModel):
    """
Station information model
    """
    gml_id = CharField(primary_key=True)
    #Foreign keys must have a primary key or a unique constraint, so
    #It cannot be specified as a foreign key for multiple data.
    location = CharField(index=True)
    railway_type = IntegerField()
    service_provider_type = IntegerField()
    railway_line_name = CharField(index=True)
    operation_company = CharField(index=True)
    station_name = CharField(index=True)
    railroad_section = ForeignKeyField(
        db_column='railroad_section_id',
        rel_model=RailRoadSection,
        to_field='gml_id',
        index=True
    )


def setup(path):
    """
Database setup
    @param path database path
    """
    global database
    database = SqliteDatabase(path)
    database_proxy.initialize(database)
    database.create_tables([Curve, RailRoadSection, Station], True)


def import_railway(xml):
    """
National Land Numerical Institute N02-XX.Import route and station information from xml
    TODO:
Inefficient import of foreign keys
    @param xml XML path
    """
    commit_cnt = 2000  #INSERT every number specified here
    f = None
    contents = None
    namespaces = {
        'ksj': 'http://nlftp.mlit.go.jp/ksj/schemas/ksj-app',
        'gml': 'http://www.opengis.net/gml/3.2',
        'xlink': 'http://www.w3.org/1999/xlink',
        'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
    }

    with database.transaction():
        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://www.opengis.net/gml/3.2}Curve',
            recover=True
        )
        for event, curve in context:
            curveId = curve.get('{http://www.opengis.net/gml/3.2}id')
            print (curveId)
            posLists = curve.xpath('.//gml:posList', namespaces=namespaces)
            for posList in posLists:
                points = posList.text.split("\n")
                for point in points:
                    pt = point.strip().split(' ')
                    if len(pt) != 2:
                        continue
                    insert_buff.append({
                        'curve_id': curveId,
                        'lat': float(pt[0]),
                        'lng': float(pt[1])
                    })
                    if len(insert_buff) >= commit_cnt:
                        Curve.insert_many(insert_buff).execute()
                        insert_buff = []
        if len(insert_buff):
            Curve.insert_many(insert_buff).execute()
        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}RailroadSection',
            recover=True
        )
        for event, railroad in context:
            railroadSectionId = railroad.get(
                '{http://www.opengis.net/gml/3.2}id'
            )
            locationId = railroad.find(
                'ksj:location',
                namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            railwayType = railroad.find(
                'ksj:railwayType', namespaces=namespaces
            ).text
            serviceProviderType = railroad.find(
                'ksj:serviceProviderType',
                namespaces=namespaces
            ).text
            railwayLineName = railroad.find(
                'ksj:railwayLineName',
                namespaces=namespaces
            ).text
            operationCompany = railroad.find(
                'ksj:operationCompany',
                namespaces=namespaces
            ).text
            insert_buff.append({
                'gml_id': railroadSectionId,
                'location': locationId,
                'railway_type': railwayType,
                'service_provider_type': serviceProviderType,
                'railway_line_name': railwayLineName,
                'operation_company': operationCompany
            })
            print (railroadSectionId)
            if len(insert_buff) >= commit_cnt:
                RailRoadSection.insert_many(insert_buff).execute()
                insert_buff = []
        if len(insert_buff):
            RailRoadSection.insert_many(insert_buff).execute()

        insert_buff = []
        context = etree.iterparse(
            xml,
            events=('end',),
            tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Station',
            recover=True
        )
        for event, railroad in context:
            stationId = railroad.get('{http://www.opengis.net/gml/3.2}id')
            locationId = railroad.find(
                'ksj:location', namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            railwayType = railroad.find(
                'ksj:railwayType',
                namespaces=namespaces
            ).text
            serviceProviderType = railroad.find(
                'ksj:serviceProviderType',
                namespaces=namespaces
            ).text
            railwayLineName = railroad.find(
                'ksj:railwayLineName',
                namespaces=namespaces
            ).text
            operationCompany = railroad.find(
                'ksj:operationCompany',
                namespaces=namespaces
            ).text
            stationName = railroad.find(
                'ksj:stationName',
                namespaces=namespaces
            ).text
            railroadSection = railroad.find(
                'ksj:railroadSection',
                namespaces=namespaces
            ).get('{http://www.w3.org/1999/xlink}href')[1:]
            print (stationId)
            insert_buff.append({
                'gml_id': stationId,
                'location': locationId,
                'railway_type': railwayType,
                'service_provider_type': serviceProviderType,
                'railway_line_name': railwayLineName,
                'operation_company': operationCompany,
                'station_name': stationName,
                'railroad_section': RailRoadSection.get(
                    RailRoadSection.gml_id == railroadSection
                )
            })
            if len(insert_buff) >= commit_cnt:
                Station.insert_many(insert_buff).execute()
                insert_buff = []
        if len(insert_buff):
            Station.insert_many(insert_buff).execute()

Once stored in the database, the rest is easy to use.

Precautions for use

The points I noticed when handling numerical national land information (railway data) are described below.

・ It is not possible to narrow down by just the route name. For example, in the case of "Line 1", "Yokohama City" may hold it or "Chiba Monorail" may hold it. Therefore, it is necessary to narrow down by "operating company" and "route name".

・ The name may be different from the name you always use. JR East has become East Japan Railway Company, and Tokyo Metro has become Tokyo Metro.

・ The route may be different from the one you always use. For example, in a normal route map, "Tokyo" is included in the "Chuo Line". However, "Tokyo" is not included in the "Chuo Line" as national land numerical information. "Tokyo"-"Kanda" is considered to be the "Tohoku Line". It seems that this is because the section between Tokyo Station and Kanda Station runs on a dedicated track laid on the Tohoku Main Line.

Let's utilize the railway data of national land numerical information

Purpose

About data

Storage in database

railway_db.py

Precautions for use

`railway_db.py`