Introduction

When I was looking for some good material as an example of using open data, I found that water level data was published on the site of Data City Sabae, so I tried machine learning using this. It was.

http://data.city.sabae.lg.jp/top_page/

Download data

On the "Open Data" page on the above site, the "Disaster Prevention" group has the following notation.

Water level data(Sabae City, Fukui Prefecture)
Rontegawa drainage pump station[CSV]
It is the data of the water level gauge in Sabae city. Water level unit:cm data:1000 cases

スクリーンショット 2016-10-28 12.41.43.png

By default, it is said that there are 1,000 data items, but I will use it because I was able to get a little more data.

In addition, past weather data can be downloaded from the Japan Meteorological Agency, so download the precipitation data of nearby Fukui City.

http://www.data.jma.go.jp/gmd/risk/obsdl/index.php

Loading the library

Use Jupyter Notebook to load the following libraries.

`python`


from ipywidgets import FloatProgress
from IPython.display import display

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import datetime

Reading water level data

`python`


filename = "sparql.csv"
df = pd.read_csv(filename, header=None)

Let's display it as a graph.

`python`


tmp = []
for i in range(len(df)):
    pos = len(df) - 1 - i
    tmp.append(df.ix[pos][2])

pd.DataFrame({'level': np.array(tmp)}).plot(figsize=(15,5))

Water level data is acquired every 5 minutes, and the data is processed to match the time series with the data of the Japan Meteorological Agency.

`python`


#Get data start and end dates
dt1 = datetime.datetime.strptime(df[1][len(df)-1],"%Y-%m-%dT%H:%M:%S+09:00")
dt1 = datetime.datetime(dt1.year,dt1.month,dt1.day,0,0)
dt2 = datetime.datetime.strptime(df[1][0],"%Y-%m-%dT%H:%M:%S+09:00")

print("dt1:",dt1)
print("dt2:",dt2)

#Get the number of days of data
dt = (dt2-dt1).days + 1

#Prepare an array to store hourly data
level = [0] * dt * 24
dt_al = [0] * dt * 24

#Progress bar settings
fp = FloatProgress(min=0, max=len(df))
display(fp)

for i in range(len(df)):
    wk = datetime.datetime.strptime(df[1][len(df)-i-1],"%Y-%m-%dT%H:%M:%S+09:00")
    pos = (wk - dt1).days * 24 + wk.hour
    dt_al[pos] = datetime.datetime(wk.year,wk.month,wk.day,wk.hour,0)

    if wk.minute == 0:
        level[pos] = df[2][len(df)-1-i]
    
    fp.value = i

Reading precipitation data

Read the data paying attention to the fact that the CSV contains data that is not counted and that the character code is Shift JIS. Also, try displaying the read data as a graph.

`python`


filename = "data.csv"
df = pd.read_csv(filename,encoding="SHIFT-JIS",skiprows=4)
df.plot(figsize=(15,5))

Store water level and precipitation data in the same format array

To make the data easier to handle, store it in an array and then display it as a graph.

`python`


#Array preparation
rain = [0]*len(level)

for i in range(len(df)):
    wk = datetime.datetime.strptime(df.ix[i][0],"%Y/%m/%d %H:%M:%S")
    if (wk < dt2) and (wk - dt1).days >= 0:
        pos = (wk - dt1).days * 24 + wk.hour
        rain[pos] = df.ix[i][1]

#Check the data on the graph
pp = pd.DataFrame({'level': np.array(level), 'rain': np.array(rain)*15})
pp.plot(figsize=(15,5))

There seems to be a lot of missing data ... (sweat)

Examination of learning data

Looking at the graph, it seems that the water level tends to increase after it rains, so let's input the precipitation information from 48 hours ago to that time and use the water level as the output teacher data.

`python`


#Get 48 hours of precipitation in a two-dimensional array
row = len(level)
tmp = np.zeros((row,48))

fp = FloatProgress(min=0, max=row)
display(fp)

for i in range(row):
    for j in range(len(tmp[0])):
        pos = row - 1 - i - j
        tmp[row-1-i][j] = rain[pos]
    fp.value = i

Trimming missing data

If the water level data has not been obtained, it is not necessary and will be removed.

`python`


#Check the number of missing data
num = 0
for i in range(len(level)):
    if level[i] == 0:
        num += 1

#Preparing for data storage
X = np.empty((0,48))
y = []

for i in range(len(level)):
    if level[i] > 0:
        X = np.append(X, np.array([tmp[i]]), axis=0)
        y.append(level[i])

#Check the data on the graph
pp = pd.DataFrame({'level': np.array(y), 'rain': X[:,0]*20})
pp.plot(figsize=(15,5))

If you look at the graph, you can see that it has become quite beautiful.

Machine learning

Learn from the cleaned data and check the score of the predicted result.

`python`


#Load the cross-validation module
from sklearn import cross_validation

#Training set with labeled data(X_train, y_train)And test set(X_test, y_test)Divided into
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=.2, random_state=42)

#Normalization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Model settings (random forest)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, max_depth=50, random_state=42)

#Learning and prediction
model.fit(X_train, y_train)
result = model.predict(X_test)
result.shape

#Score
print(model.score(X_test,y_test))

Result is...

`python`


0.185628742515

... no!

Verification of results

The score is low, but let's check the result with a graph.

`python`


pp = pd.DataFrame({'act': np.array(y_test), "pred": np.array(result)})
pp.plot(figsize=(15,5))

... Hmm, subtle.

With a little ingenuity, the data is divided into time series for learning and prediction as shown below.

`python`


num = int(len(X) * 0.8)
print(len(X), num, len(X)-num)

X_train = X[:num]
X_test = X[num:]
y_train = y[:num]
y_test = y[num:]

... what! A little nice feeling (^-^)

Then, thinking about what can be done from this result, I think it can be used to detect a sudden rise in water level and give an evacuation warning by continuously predicting the water level from precipitation.

With that in mind, I hope more local governments will release such data.

What should I do next?

Postscript

I improved the accuracy by a learning method different from this article, and I was able to predict the water level one hour later, so I wrote it again. If you are interested, please also see the following URL.

Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2

Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae

Introduction

Download data

Loading the library

python

Reading water level data

python

python

python

Reading precipitation data

python

Store water level and precipitation data in the same format array

python

Examination of learning data

python

Trimming missing data

python

Machine learning

python

python

Verification of results

python

python

Postscript

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`