When I was looking for some good material as an example of using open data, I found that water level data was published on the site of Data City Sabae, so I tried machine learning using this. It was.
http://data.city.sabae.lg.jp/top_page/
On the "Open Data" page on the above site, the "Disaster Prevention" group has the following notation.
Water level data(Sabae City, Fukui Prefecture)
Rontegawa drainage pump station[CSV]
It is the data of the water level gauge in Sabae city. Water level unit:cm data:1000 cases
By default, it is said that there are 1,000 data items, but I will use it because I was able to get a little more data.
In addition, past weather data can be downloaded from the Japan Meteorological Agency, so download the precipitation data of nearby Fukui City.
http://www.data.jma.go.jp/gmd/risk/obsdl/index.php
Use Jupyter Notebook to load the following libraries.
python
from ipywidgets import FloatProgress
from IPython.display import display
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import datetime
python
filename = "sparql.csv"
df = pd.read_csv(filename, header=None)
Let's display it as a graph.
python
tmp = []
for i in range(len(df)):
pos = len(df) - 1 - i
tmp.append(df.ix[pos][2])
pd.DataFrame({'level': np.array(tmp)}).plot(figsize=(15,5))
Water level data is acquired every 5 minutes, and the data is processed to match the time series with the data of the Japan Meteorological Agency.
python
#Get data start and end dates
dt1 = datetime.datetime.strptime(df[1][len(df)-1],"%Y-%m-%dT%H:%M:%S+09:00")
dt1 = datetime.datetime(dt1.year,dt1.month,dt1.day,0,0)
dt2 = datetime.datetime.strptime(df[1][0],"%Y-%m-%dT%H:%M:%S+09:00")
print("dt1:",dt1)
print("dt2:",dt2)
#Get the number of days of data
dt = (dt2-dt1).days + 1
#Prepare an array to store hourly data
level = [0] * dt * 24
dt_al = [0] * dt * 24
#Progress bar settings
fp = FloatProgress(min=0, max=len(df))
display(fp)
for i in range(len(df)):
wk = datetime.datetime.strptime(df[1][len(df)-i-1],"%Y-%m-%dT%H:%M:%S+09:00")
pos = (wk - dt1).days * 24 + wk.hour
dt_al[pos] = datetime.datetime(wk.year,wk.month,wk.day,wk.hour,0)
if wk.minute == 0:
level[pos] = df[2][len(df)-1-i]
fp.value = i
Read the data paying attention to the fact that the CSV contains data that is not counted and that the character code is Shift JIS. Also, try displaying the read data as a graph.
python
filename = "data.csv"
df = pd.read_csv(filename,encoding="SHIFT-JIS",skiprows=4)
df.plot(figsize=(15,5))
To make the data easier to handle, store it in an array and then display it as a graph.
python
#Array preparation
rain = [0]*len(level)
for i in range(len(df)):
wk = datetime.datetime.strptime(df.ix[i][0],"%Y/%m/%d %H:%M:%S")
if (wk < dt2) and (wk - dt1).days >= 0:
pos = (wk - dt1).days * 24 + wk.hour
rain[pos] = df.ix[i][1]
#Check the data on the graph
pp = pd.DataFrame({'level': np.array(level), 'rain': np.array(rain)*15})
pp.plot(figsize=(15,5))
There seems to be a lot of missing data ... (sweat)
Looking at the graph, it seems that the water level tends to increase after it rains, so let's input the precipitation information from 48 hours ago to that time and use the water level as the output teacher data.
python
#Get 48 hours of precipitation in a two-dimensional array
row = len(level)
tmp = np.zeros((row,48))
fp = FloatProgress(min=0, max=row)
display(fp)
for i in range(row):
for j in range(len(tmp[0])):
pos = row - 1 - i - j
tmp[row-1-i][j] = rain[pos]
fp.value = i
If the water level data has not been obtained, it is not necessary and will be removed.
python
#Check the number of missing data
num = 0
for i in range(len(level)):
if level[i] == 0:
num += 1
#Preparing for data storage
X = np.empty((0,48))
y = []
for i in range(len(level)):
if level[i] > 0:
X = np.append(X, np.array([tmp[i]]), axis=0)
y.append(level[i])
#Check the data on the graph
pp = pd.DataFrame({'level': np.array(y), 'rain': X[:,0]*20})
pp.plot(figsize=(15,5))
If you look at the graph, you can see that it has become quite beautiful.
Learn from the cleaned data and check the score of the predicted result.
python
#Load the cross-validation module
from sklearn import cross_validation
#Training set with labeled data(X_train, y_train)And test set(X_test, y_test)Divided into
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=.2, random_state=42)
#Normalization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#Model settings (random forest)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, max_depth=50, random_state=42)
#Learning and prediction
model.fit(X_train, y_train)
result = model.predict(X_test)
result.shape
#Score
print(model.score(X_test,y_test))
Result is...
python
0.185628742515
... no!
The score is low, but let's check the result with a graph.
python
pp = pd.DataFrame({'act': np.array(y_test), "pred": np.array(result)})
pp.plot(figsize=(15,5))
... Hmm, subtle.
With a little ingenuity, the data is divided into time series for learning and prediction as shown below.
python
num = int(len(X) * 0.8)
print(len(X), num, len(X)-num)
X_train = X[:num]
X_test = X[num:]
y_train = y[:num]
y_test = y[num:]
... what! A little nice feeling (^-^)
Then, thinking about what can be done from this result, I think it can be used to detect a sudden rise in water level and give an evacuation warning by continuously predicting the water level from precipitation.
With that in mind, I hope more local governments will release such data.
What should I do next?
I improved the accuracy by a learning method different from this article, and I was able to predict the water level one hour later, so I wrote it again. If you are interested, please also see the following URL.
Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2
Recommended Posts