background

A certain Twitter bot was created, and train operation information was needed anyway. As a result of various thoughts, I wondered if I could pull in the operation information of ** only specific routes ** for free from somewhere. I found that scraping from the yahoo line was the best solution.

Api and sites that deliver the operation status

After researching various things, it is currently delivering route operation information

Json of railway delay information (In json format, the routes with delays etc. are aggregated and returned)
Ekispert api (It looks excellent! But there is a charge)
Operation status of yahoo route (The most familiar one, but no api !!!)

After all, if you want to do it for free, json of railway delay information or [operation status of yahoo line](https://transit. yahoo.co.jp/traininfo/top). In the former case, delayed route information nationwide is returned. After that, the information is updated every 10 minutes ... Even 10 minutes are fatal in the morning rush hour.

The latter yahoo line, ** Actually, the contents are after the station ** .... lol (Unexpectedly unknown) There is also operation information for each route, and it is accurate and fast in real time! Then *** If you don't have an api, scrape it! *** ***

Scraping yahoo route operation information

The author, who has never done scraping "su", tries scraping. Apparently ** "Beautiful Soup" ** is used ... Let's have a look at the wonderful soup!

What is Beautiful Soup? ??

BeautifulSoup is a Python library that retrieves data from HTML and XML files. Apparently, you can ** search for HTML elements from the fetched HTML using a parser. I don't know the details, so let's try it right away!

let's! Scraping!

Click here for the site you want to scrape this time! Operation status of yahoo line Tokaido line First, let's take a look at the html of the page スクリーンショット 2019-12-14 11.46.02.png Looking at the HTML at the time of delay, there is a trouble class of dd tag

<div id="mdServiceStatus">
　　<dl>
　　<dt>
　　<span class="icnAlertLarge">[!]</span>Train delay</dt>
　　<dd class="trouble">
　　　<p>Due to the influence of the inspection inside the Utsunomiya line, the down line(For Atami)Some trains are delayed.<span>(Posted at 09:25 on December 14)</span></p>
　</dd>
　</dl>
</div><!--/#mdServiceStatus-->

On the other hand, if you look into the normal HTML, there is a normal class for the dd tag.

<div class="elmServiceStatus">
    <dl>
    <dt><span class="icnNormalLarge">[○]</span>Normal operation</dt>
    <dd class="normal">
        <p>Currently, there is no information regarding accidents and delays.</p>
    </dd>
    </dl>
</div>

In other words On delay: ** trouble class dd tag ** Normal time: ** dd tag of normal class **

** Based on this dd tag, it seems possible to determine whether it is a delay or normal operation! ** Let's do it right away!

Installation

First, put Beautiful Soup with pip

$ pip install beautifulsoup4

code

import requests
from bs4 import BeautifulSoup

#URL of the operation status of the Tokaido Line
ToukaidouLine_URL = 'https://transit.yahoo.co.jp/traininfo/detail/27/0/'

#Get web pages using Requests
ToukaidouLine_Requests = requests.get(ToukaidouLine_URL)

#Analyze web pages with BeautifulSoup
ToukaidouLine_Soup = BeautifulSoup(ToukaidouLine_Requests.text, 'html.parser')

#.Find the dd tag of the trouble class with find
if ToukaidouLine_Soup.find('dd',class_='trouble'):
    message = 'Tokaido Line is delayed'
else:
    message = 'The Tokaido Line is in normal operation'

print(message)

Execution result

Tokaido Line is delayed

Impressions

Isn't it really easy? ?? Bringing information from a web page is a dream come true! Later story: When I showed it to yahoo people, it was a secret that I was told "It's a gray zone"

I want to get the operation information of yahoo route