Introduction

Continuing from the day before yesterday, I was wondering if I could do something with the data released by Shimane Prefecture, and it seems that rainfall data has been released over a wide range, so I tried to visualize this.

[Shimane Prefecture] Daily rainfall data (for 40 days)

Check the procedure

View the structure of the public page

Catalog page

First, there is the catalog page.

https://shimane-opendata.jp/db/organization/main

Rainfall page

There is a "rainfall data" page in the catalog page.

https://shimane-opendata.jp/db/dataset/010009

Daily data page

It seems that the rainfall data saved every 10 minutes on a daily basis is saved in CSV. For example, if you want to download the data for June 30, access the following URL.

https://shimane-opendata.jp/db/dataset/010009/resource/1a8248dd-cd5e-4985-b01f-6ac79fe72140

July 1st ...

https://shimane-opendata.jp/db/dataset/010009/resource/0c9ba4db-b8eb-4b90-8e38-10abf0fd01ee

that? URLs vary greatly from day to day.

Daily CSV

Furthermore, the CSV URL is ...

https://shimane-opendata.jp/storage/download/1ddaef55-cc94-490c-bd3f-7efeec17fcf9/uryo_10min_20200701.csv

Yes, it's hard to use!

procedure

So, let's try the visualization work by the following procedure.

Get the URL of the daily page from the rainfall data page
Get the CSV URL from the daily URL page
Get the data from the obtained CSV URL
Data processing
Visualization

By the way, this time too, we will use Colaboratory.

Get the URL of the daily page

Get the URL of the daily page with the following script.

`python`


import requests
from bs4 import BeautifulSoup

urlBase = "https://shimane-opendata.jp"
urlName = urlBase + "/db/dataset/010009"

def get_tag_from_html(urlName, tag):
  url = requests.get(urlName)
  soup = BeautifulSoup(url.content, "html.parser")
  return soup.find_all(tag)

def get_page_urls_from_catalogpage(urlName):
  urlNames = []
  elems = get_tag_from_html(urlName, "a")
  for elem in elems:
    try:
      string = elem.get("class")[0]
      if string in "heading":
        href = elem.get("href")
        if href.find("resource") > 0:
          urlNames.append(urlBase + href)
    except:
      pass
  return urlNames

urlNames = get_page_urls_from_catalogpage(urlName)
print(urlNames)

Get CSV URL

Get the CSV URL with the following script.

`python`


def get_csv_urls_from_url(urlName):
  urlNames = []
  elems = get_tag_from_html(urlName, "a")
  for elem in elems:
    try:
      href = elem.get("href")
      if href.find(".csv") > 0:
        urlNames.append(href)
    except:
      pass
  return urlNames[0]

urls = []

for urlName in urlNames:
  urls.append(get_csv_urls_from_url(urlName))

print(urls)

Get data from URL and create data frame

Read the data directly from the URL obtained above. However, since CSV is mixed every 10 minutes and every hour, only every 10 minutes is targeted here. By the way, note that the character code is Shift JIS, and the first two lines contain information other than data, so exclude it.

`python`


import pandas as pd

df = pd.DataFrame()

for url in urls:
    if url.find("10min") > 0:
        df = pd.concat([df, pd.read_csv(url, encoding="Shift_JIS").iloc[2:]])

df.shape

Data confirmation and processing

`python`


df.info()

You can get the column information by executing the above.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2880 entries, 2 to 145
Columns: 345 entries,Observatory to Unnamed: 344
dtypes: object(345)
memory usage: 7.6+ MB

... there are also 345 columns.

If you look at the downloaded data in Excel, you can see that there is 10-minute rainfall and cumulative rainfall for each observatory, and the column for cumulative rainfall is blank, so I decided to exclude the column for cumulative rainfall. I will. スクリーンショット 2020-07-15 3.12.52.png

By the way, the explanation of cumulative rainfall is as follows.

Cumulative rainfall is the cumulative amount of rainfall from the time when it starts to rain to the time when it ends. The definition of the beginning of rainfall is when the rainfall is 0.0 mm to 0.5 mm or more, and the definition of the end of rainfall is when it exceeds 6 hours after the rainfall is no longer counted, and the cumulative rainfall is calculated at the end of the rainfall. Reset.

Since everyone's Dtype is object, it seems that the numerical data is a character string ...

Also, if you take a look inside, it seems that the strings "Uncollected", "Missing", and "Maintenance" are included. After removing those character information, it is converted to a real value. Since the date and time data is also a character string, this also has to be converted to a serial value.

So, execute the following script.

`python`


for col in df.columns:
  if col.find("name") > 0:
    df.pop(col)

df.index = df["Observatory"].map(lambda _: pd.to_datetime(_))
df = df.sort_index()

df = df.replace('Not collected', '-1')
df = df.replace('Missing', '-1')
df = df.replace('Maintenance', '-1')

cols = df.columns[1:]

for col in cols:
  df[col] = df[col].astype("float")

Visualization

Try drawing the graph after setting the environment so that the Japanese display does not become strange.

`python`


!pip install japanize_matplotlib

import matplotlib.pyplot as plt
import japanize_matplotlib 
import seaborn as sns

sns.set(font="IPAexGothic")

df[cols[:5]].plot(figsize=(15,5))
plt.show()

df["2020-07-12":][cols[:5]].plot(figsize=(15,5))
plt.show()

You can see the rain in the last few days at a glance.

Well, what are we going to do now?

Let's visualize the rainfall data released by Shimane Prefecture

Introduction

Check the procedure

View the structure of the public page

Catalog page

Rainfall page

Daily data page

Daily CSV

procedure

Get the URL of the daily page

python

Get CSV URL

python

Get data from URL and create data frame

python

Data confirmation and processing

python

python

Visualization

python

`python`

`python`

`python`

`python`

`python`

`python`