Continuing from the day before yesterday, I was wondering if I could do something with the data released by Shimane Prefecture, and it seems that rainfall data has been released over a wide range, so I tried to visualize this.
[Shimane Prefecture] Daily rainfall data (for 40 days)
First, there is the catalog page.
https://shimane-opendata.jp/db/organization/main
There is a "rainfall data" page in the catalog page.
https://shimane-opendata.jp/db/dataset/010009
It seems that the rainfall data saved every 10 minutes on a daily basis is saved in CSV. For example, if you want to download the data for June 30, access the following URL.
https://shimane-opendata.jp/db/dataset/010009/resource/1a8248dd-cd5e-4985-b01f-6ac79fe72140
July 1st ...
https://shimane-opendata.jp/db/dataset/010009/resource/0c9ba4db-b8eb-4b90-8e38-10abf0fd01ee
that? URLs vary greatly from day to day.
Furthermore, the CSV URL is ...
https://shimane-opendata.jp/storage/download/1ddaef55-cc94-490c-bd3f-7efeec17fcf9/uryo_10min_20200701.csv
Yes, it's hard to use!
So, let's try the visualization work by the following procedure.
By the way, this time too, we will use Colaboratory.
Get the URL of the daily page with the following script.
python
import requests
from bs4 import BeautifulSoup
urlBase = "https://shimane-opendata.jp"
urlName = urlBase + "/db/dataset/010009"
def get_tag_from_html(urlName, tag):
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")
return soup.find_all(tag)
def get_page_urls_from_catalogpage(urlName):
urlNames = []
elems = get_tag_from_html(urlName, "a")
for elem in elems:
try:
string = elem.get("class")[0]
if string in "heading":
href = elem.get("href")
if href.find("resource") > 0:
urlNames.append(urlBase + href)
except:
pass
return urlNames
urlNames = get_page_urls_from_catalogpage(urlName)
print(urlNames)
Get the CSV URL with the following script.
python
def get_csv_urls_from_url(urlName):
urlNames = []
elems = get_tag_from_html(urlName, "a")
for elem in elems:
try:
href = elem.get("href")
if href.find(".csv") > 0:
urlNames.append(href)
except:
pass
return urlNames[0]
urls = []
for urlName in urlNames:
urls.append(get_csv_urls_from_url(urlName))
print(urls)
Read the data directly from the URL obtained above. However, since CSV is mixed every 10 minutes and every hour, only every 10 minutes is targeted here. By the way, note that the character code is Shift JIS, and the first two lines contain information other than data, so exclude it.
python
import pandas as pd
df = pd.DataFrame()
for url in urls:
if url.find("10min") > 0:
df = pd.concat([df, pd.read_csv(url, encoding="Shift_JIS").iloc[2:]])
df.shape
python
df.info()
You can get the column information by executing the above.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2880 entries, 2 to 145
Columns: 345 entries,Observatory to Unnamed: 344
dtypes: object(345)
memory usage: 7.6+ MB
... there are also 345 columns.
If you look at the downloaded data in Excel, you can see that there is 10-minute rainfall and cumulative rainfall for each observatory, and the column for cumulative rainfall is blank, so I decided to exclude the column for cumulative rainfall. I will.
By the way, the explanation of cumulative rainfall is as follows.
Cumulative rainfall is the cumulative amount of rainfall from the time when it starts to rain to the time when it ends. The definition of the beginning of rainfall is when the rainfall is 0.0 mm to 0.5 mm or more, and the definition of the end of rainfall is when it exceeds 6 hours after the rainfall is no longer counted, and the cumulative rainfall is calculated at the end of the rainfall. Reset.
Since everyone's Dtype is object, it seems that the numerical data is a character string ...
Also, if you take a look inside, it seems that the strings "Uncollected", "Missing", and "Maintenance" are included. After removing those character information, it is converted to a real value. Since the date and time data is also a character string, this also has to be converted to a serial value.
So, execute the following script.
python
for col in df.columns:
if col.find("name") > 0:
df.pop(col)
df.index = df["Observatory"].map(lambda _: pd.to_datetime(_))
df = df.sort_index()
df = df.replace('Not collected', '-1')
df = df.replace('Missing', '-1')
df = df.replace('Maintenance', '-1')
cols = df.columns[1:]
for col in cols:
df[col] = df[col].astype("float")
Try drawing the graph after setting the environment so that the Japanese display does not become strange.
python
!pip install japanize_matplotlib
import matplotlib.pyplot as plt
import japanize_matplotlib
import seaborn as sns
sns.set(font="IPAexGothic")
df[cols[:5]].plot(figsize=(15,5))
plt.show()
df["2020-07-12":][cols[:5]].plot(figsize=(15,5))
plt.show()
You can see the rain in the last few days at a glance.
Well, what are we going to do now?