With the national land numerical information download service, you can obtain data managed by the Ministry of Land, Infrastructure, Transport and Tourism. This time, I will plot the coordinates on Google Map using railway data.
demo: http://needtec.sakura.ne.jp/railway_location/railway
GIT: https://github.com/mima3/railway_location
Railway data can be downloaded from the following page.
** National land numerical information Railway data ** http://nlftp.mlit.go.jp/ksj/gml/datalist/KsjTmplt-N02-v2_2.html
Please refer to the following for the use of XML in the downloaded file. http://nlftp.mlit.go.jp/ksj/gml/product_spec/KS-PS-N02-v2_1.pdf
Simply put, railroad data contains information that shows the shape of the line and station information. Here, the important elements are: ・ Gml: Curve curve information ・ Ksj: railroadSection Railway section information ・ Ksj: Station Station information
Coordinate information is stored in Curve. The link to Curve is stored in the location element of railroadSection and Station.
Since it is difficult to handle a large amount of data in XML, it is temporarily stored in a relational database.
At this time, a large size XML is analyzed, but if the entire XML file is once stored in memory and parsed, the memory usage will increase dramatically and it will not be possible to process it. Therefore, use lxml.etree.iterparse to process sequentially.
However, when parsing N02-XX.xml with lxml.etree.itreparse, an error occurs. This is because there is a line in the XML that looks like this:
xmlns:schemaLocation="http://nlftp.mlit.go.jp/ksj/schemas/ksj-app KsjAppSchema-N02-v2_0.xsd">
lxml considers the URI specified here as an invalid URI and outputs an error. To avoid this, it is necessary to specify recover = True when parsing XML in lxml. http://stackoverflow.com/questions/18692965/how-do-i-skip-validating-the-uri-in-lxml
** Workaround: **
context = etree.iterparse(
xml,
events=('end',),
tag='{http://www.opengis.net/gml/3.2}Curve',
recover=True
)
In iterparse, this argument was introduced after lxml == 3.4.1, so you need to specify the version to install lxml.
easy_install lxml==3.4.1
Based on the above, the process of importing railway data XML into the database is as follows.
railway_db.py
# -*- coding: utf-8 -*-
import sqlite3
import sys
import os
# easy_install lxml==3.4.1
from lxml import etree
from peewee import *
database_proxy = Proxy()
database = None
class BaseModel(Model):
"""
Model class base
"""
class Meta:
database = database_proxy
class Curve(BaseModel):
"""
Curve information model
"""
curve_id = CharField(index=True, unique=False)
lat = DoubleField()
lng = DoubleField()
class RailRoadSection(BaseModel):
"""
Railway section information model
"""
gml_id = CharField(primary_key=True)
#Foreign keys must have a primary key or a unique constraint, so
#It cannot be specified as a foreign key for multiple data.
location = CharField(index=True)
railway_type = IntegerField()
service_provider_type = IntegerField()
railway_line_name = CharField(index=True)
operation_company = CharField(index=True)
class Station(BaseModel):
"""
Station information model
"""
gml_id = CharField(primary_key=True)
#Foreign keys must have a primary key or a unique constraint, so
#It cannot be specified as a foreign key for multiple data.
location = CharField(index=True)
railway_type = IntegerField()
service_provider_type = IntegerField()
railway_line_name = CharField(index=True)
operation_company = CharField(index=True)
station_name = CharField(index=True)
railroad_section = ForeignKeyField(
db_column='railroad_section_id',
rel_model=RailRoadSection,
to_field='gml_id',
index=True
)
def setup(path):
"""
Database setup
@param path database path
"""
global database
database = SqliteDatabase(path)
database_proxy.initialize(database)
database.create_tables([Curve, RailRoadSection, Station], True)
def import_railway(xml):
"""
National Land Numerical Institute N02-XX.Import route and station information from xml
TODO:
Inefficient import of foreign keys
@param xml XML path
"""
commit_cnt = 2000 #INSERT every number specified here
f = None
contents = None
namespaces = {
'ksj': 'http://nlftp.mlit.go.jp/ksj/schemas/ksj-app',
'gml': 'http://www.opengis.net/gml/3.2',
'xlink': 'http://www.w3.org/1999/xlink',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
with database.transaction():
insert_buff = []
context = etree.iterparse(
xml,
events=('end',),
tag='{http://www.opengis.net/gml/3.2}Curve',
recover=True
)
for event, curve in context:
curveId = curve.get('{http://www.opengis.net/gml/3.2}id')
print (curveId)
posLists = curve.xpath('.//gml:posList', namespaces=namespaces)
for posList in posLists:
points = posList.text.split("\n")
for point in points:
pt = point.strip().split(' ')
if len(pt) != 2:
continue
insert_buff.append({
'curve_id': curveId,
'lat': float(pt[0]),
'lng': float(pt[1])
})
if len(insert_buff) >= commit_cnt:
Curve.insert_many(insert_buff).execute()
insert_buff = []
if len(insert_buff):
Curve.insert_many(insert_buff).execute()
insert_buff = []
context = etree.iterparse(
xml,
events=('end',),
tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}RailroadSection',
recover=True
)
for event, railroad in context:
railroadSectionId = railroad.get(
'{http://www.opengis.net/gml/3.2}id'
)
locationId = railroad.find(
'ksj:location',
namespaces=namespaces
).get('{http://www.w3.org/1999/xlink}href')[1:]
railwayType = railroad.find(
'ksj:railwayType', namespaces=namespaces
).text
serviceProviderType = railroad.find(
'ksj:serviceProviderType',
namespaces=namespaces
).text
railwayLineName = railroad.find(
'ksj:railwayLineName',
namespaces=namespaces
).text
operationCompany = railroad.find(
'ksj:operationCompany',
namespaces=namespaces
).text
insert_buff.append({
'gml_id': railroadSectionId,
'location': locationId,
'railway_type': railwayType,
'service_provider_type': serviceProviderType,
'railway_line_name': railwayLineName,
'operation_company': operationCompany
})
print (railroadSectionId)
if len(insert_buff) >= commit_cnt:
RailRoadSection.insert_many(insert_buff).execute()
insert_buff = []
if len(insert_buff):
RailRoadSection.insert_many(insert_buff).execute()
insert_buff = []
context = etree.iterparse(
xml,
events=('end',),
tag='{http://nlftp.mlit.go.jp/ksj/schemas/ksj-app}Station',
recover=True
)
for event, railroad in context:
stationId = railroad.get('{http://www.opengis.net/gml/3.2}id')
locationId = railroad.find(
'ksj:location', namespaces=namespaces
).get('{http://www.w3.org/1999/xlink}href')[1:]
railwayType = railroad.find(
'ksj:railwayType',
namespaces=namespaces
).text
serviceProviderType = railroad.find(
'ksj:serviceProviderType',
namespaces=namespaces
).text
railwayLineName = railroad.find(
'ksj:railwayLineName',
namespaces=namespaces
).text
operationCompany = railroad.find(
'ksj:operationCompany',
namespaces=namespaces
).text
stationName = railroad.find(
'ksj:stationName',
namespaces=namespaces
).text
railroadSection = railroad.find(
'ksj:railroadSection',
namespaces=namespaces
).get('{http://www.w3.org/1999/xlink}href')[1:]
print (stationId)
insert_buff.append({
'gml_id': stationId,
'location': locationId,
'railway_type': railwayType,
'service_provider_type': serviceProviderType,
'railway_line_name': railwayLineName,
'operation_company': operationCompany,
'station_name': stationName,
'railroad_section': RailRoadSection.get(
RailRoadSection.gml_id == railroadSection
)
})
if len(insert_buff) >= commit_cnt:
Station.insert_many(insert_buff).execute()
insert_buff = []
if len(insert_buff):
Station.insert_many(insert_buff).execute()
Once stored in the database, the rest is easy to use.
The points I noticed when handling numerical national land information (railway data) are described below.
・ It is not possible to narrow down by just the route name. For example, in the case of "Line 1", "Yokohama City" may hold it or "Chiba Monorail" may hold it. Therefore, it is necessary to narrow down by "operating company" and "route name".
・ The name may be different from the name you always use. JR East has become East Japan Railway Company, and Tokyo Metro has become Tokyo Metro.
・ The route may be different from the one you always use. For example, in a normal route map, "Tokyo" is included in the "Chuo Line". However, "Tokyo" is not included in the "Chuo Line" as national land numerical information. "Tokyo"-"Kanda" is considered to be the "Tohoku Line". It seems that this is because the section between Tokyo Station and Kanda Station runs on a dedicated track laid on the Tohoku Main Line.
Recommended Posts