This entry is a sequel to the previously written I tried to predict the presence or absence of snow by machine learning. At this time, I predicted only the presence or absence of snow (1 or 0), but I tried a little more to predict the change in the amount of snow.
When I wrote down the result first, it looked like this. The horizontal axis is the number of days, and the vertical axis is the amount of snow (cm).
Result 1 (blue is the actual amount of snow, red line is the predicted amount of snow)
Result 2 (blue is the actual amount of snow, red line is the predicted amount of snow)
Please read the following to find out what "Result 1" and "Result 2" are respectively.
Previously, I tried to predict the presence or absence of snow by using scikit-learn
in I tried to predict the presence or absence of snow by machine learning. However, I got a little greedy and wanted to predict the actual amount of snow (cm) for a certain period, not whether it was present or not.
Specifically, we will acquire meteorological data such as snow cover
wind speed`` temperature
provided by the Japan Meteorological Agency, and use the data for the first 7500 days for learning, and the remaining 2 years (365x2 = Predict changes in snowfall (730 days) and compare with actual changes in snowfall.
The learning data will be the one published by the Japan Meteorological Agency. Specifically, please refer to the previously written I tried to predict the presence or absence of snow by machine learning.
The obtained CSV data looks like this. The target was Tonami City in Toyama, which has a lot of snow.
data_2013_2015.csv
Download time: 2016/03/20 20:31:19
,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami
Date and time,temperature(℃),temperature(℃),temperature(℃),Snow cover(cm),Snow cover(cm),Snow cover(cm),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),Precipitation(mm),Precipitation(mm),Precipitation(mm)
,,,,,,,,,Wind direction,Wind direction,,,,
,,quality information,Homogeneous number,,quality information,Homogeneous number,,quality information,,quality information,Homogeneous number,,quality information,Homogeneous number
2013/2/1 1:00:00,-3.3,8,1,3,8,1,0.4,8,West,8,1,0.0,8,1
2013/2/1 2:00:00,-3.7,8,1,3,8,1,0.3,8,North,8,1,0.0,8,1
2013/2/1 3:00:00,-4.0,8,1,3,8,1,0.2,8,Quiet,8,1,0.0,8,1
2013/2/1 4:00:00,-4.8,8,1,3,8,1,0.9,8,South-southeast,8,1,0.0,8,1
...
The idea is that this kind of prediction is probably standard, but we train the model with some types of peripheral data and the resulting amount of snow as a set, and only the peripheral data is applied to the resulting model. It is to give and get the predicted value of the amount of snowfall. So-called "supervised learning" </ b>.
In this case, the following data was used as peripheral data.
Expressed as an image, it looks like this.
[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day
[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day
[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day
....
[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day
So, based on this, give only the peripheral data and get the predicted value
[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ (Predicted amount of snow on the day)
I did it like this. Basically, the data of the forecast target date is given, but only one yesterday's snowfall amount
is the data one day before the forecast target date. And it seemed to have the most impact on the data it gave. Well, when you think about it, it's natural.
As I wrote at the beginning, I will use the data for about 7500 days from the data obtained from the Japan Meteorological Agency for learning, predict the change in snow cover for the remaining 2 years, and compare it with the actual change in snow cover.
The actual code looks like this:
snow_forecaster.py
import csv
import numpy as np
from matplotlib import pyplot
from sklearn import linear_model
from sklearn import cross_validation
class SnowForecast:
def __init__(self):
u"""Initialize each instance variable"""
self.model = None #Generated learning model
self.data = [] #Array of training data
self.target = [] #Array of actual snow cover
self.predicts = [] #Array of predicted values of snowfall
self.reals = [] #Array of actual snow cover
self.day_counts = [] #Array of elapsed dates from the start date
self.date_list = []
self.record_count = 0
def load_csv(self):
u"""Read a CSV file for learning"""
with open("sample_data/data.csv", "r") as f:
reader = csv.reader(f)
accumulation_yesterday0 = 0
date_yesterday = ""
temp_3days = []
wind_speed_3days = []
for row in reader:
if row[4] == "":
continue
daytime = row[0] # "yyyy/mmdd HH:MM:SS"
date = daytime.split(" ")[0] # "yyyy/mm/dd"
temp = int(float(row[1])) #temperature. There is a subtle effect
wind_speed = float(row[7]) #wind speed. There is a subtle effect
precipitation = float(row[12]) #Precipitation. no effect
accumulation = int(row[4]) #The amount of snow. The amount of snowfall yesterday has a big impact
if len(wind_speed_3days) == 3:
#Training data
# [temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]
sample = [temp, wind_speed, accumulation_yesterday0]
sample.extend(temp_3days)
sample.extend(wind_speed_3days)
self.data.append(sample)
self.target.append(accumulation)
if date_yesterday != date:
accumulation_yesterday0 = accumulation
self.date_list.append(date)
wind_speed_3days.insert(0, wind_speed)
if len(wind_speed_3days) > 3:
wind_speed_3days.pop()
temp_3days.insert(0, temp)
if len(temp_3days) > 3:
temp_3days.pop()
date_yesterday = date
self.record_count = len(self.data)
return self.data
def train(self):
u"""Generate a learning model. Use the training data up to about 7500 days of the original data"""
x = self.data
y = self.target
print(len(x))
# ElasticNetCV,LassoCV,Select Elastic NetCV with the smallest error from RidgeCV
model = linear_model.ElasticNetCV(fit_intercept=True)
model.fit(x[0:self.training_data_count()], y[0:self.training_data_count()])
self.model = model
def predict(self):
u"""Predict using a learning model. Forecast for the last two years"""
x = self.data
y = self.target
model = self.model
for i, xi in enumerate(x):
real_val = y[i]
if i < self.training_data_count() + 1:
self.predicts.append(0)
self.reals.append(real_val)
self.day_counts.append(i)
continue
predict_val = int(model.predict([xi])[0])
#If the snowfall forecast is 0 or less, it is set to 0.
if predict_val < 0:
predict_val = 0
self.predicts.append(predict_val)
self.reals.append(real_val)
self.day_counts.append(i)
def show_graph(self):
u"""Compare predicted and measured values with a graph"""
pyplot.plot(self.day_counts[self.predict_start_num():], self.reals[self.predict_start_num():], "b")
pyplot.plot(self.day_counts[self.predict_start_num():], self.predicts[self.predict_start_num():], "r")
pyplot.show()
def check(self):
u"""Measure the error between training data and forecast data"""
x = np.array(self.data[self.predict_start_num():])
y = np.array(self.target[self.predict_start_num():])
model = self.model
p = np.array(self.predicts[self.predict_start_num():])
e = p - np.array(self.reals[self.predict_start_num():])
error = np.sum(e * e)
rmse_10cv = np.sqrt(error / len(self.data[self.predict_start_num():]))
print("RMSE(10-fold CV: {})".format(rmse_10cv))
def training_data_count(self):
u"""Leave the last two years and use the data before that as training data. Returns the number of training data"""
return self.record_count - 365 * 2
def predict_start_num(self):
u"""The last two years are predicted and used to measure the error from the measured value. Returns the predicted start position"""
return self.training_data_count() + 1
if __name__ == "__main__":
forecaster = SnowForecast()
forecaster.load_csv()
forecaster.train()
forecaster.predict()
forecaster.check()
forecaster.show_graph()
The most annoying part was creating training data from raw data as in the previous chapter. Still, it's easy because it's python.
So, the execution result is as follows (blue is the actual amount of snow, red line is the predicted amount of snow). This is the first "result 1" shown.
I'm predicting something like that.
At this point, I suddenly wondered how to do this. "But I'm predicting by giving the amount of snow one day ago, so when I actually try to use it for future prediction, I can only predict the amount of snow tomorrow ...?" b>
No, do you know? If you say that, the temperature and wind speed will be the same. But you see, they're weather forecasts ... Gefun Gefun
So, I immediately modified the code like that.
There are no particular changes to the learning part of the model. Of the data given when predicting the amount of snowfall, let's replace the amount of snowfall for yesterday
with the` predicted value one day before, which was predicted by himself, instead of the actual measurement value.
The code is as follows. Only the predict
function has changed.
snow_forecaster.py
import csv
import numpy as np
from matplotlib import pyplot
from sklearn import linear_model
from sklearn import cross_validation
class SnowForecast:
def __init__(self):
u"""Initialize each instance variable"""
self.model = None #Generated learning model
self.data = [] #Array of training data
self.target = [] #Array of actual snow cover
self.predicts = [] #Array of predicted values of snowfall
self.reals = [] #Array of actual snow cover
self.day_counts = [] #Array of elapsed dates from the start date
self.date_list = []
self.record_count = 0
def load_csv(self):
u"""Read a CSV file for learning"""
with open("sample_data/data.csv", "r") as f:
reader = csv.reader(f)
accumulation_yesterday0 = 0
date_yesterday = ""
temp_3days = []
wind_speed_3days = []
for row in reader:
if row[4] == "":
continue
daytime = row[0] # "yyyy/mmdd HH:MM:SS"
date = daytime.split(" ")[0] # "yyyy/mm/dd"
temp = int(float(row[1])) #temperature. There is a subtle effect
wind_speed = float(row[7]) #wind speed. There is a subtle effect
precipitation = float(row[12]) #Precipitation. no effect
accumulation = int(row[4]) #The amount of snow. The amount of snowfall yesterday has a big impact
if len(wind_speed_3days) == 3:
#Training data
# [temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]
sample = [temp, wind_speed, accumulation_yesterday0]
sample.extend(temp_3days)
sample.extend(wind_speed_3days)
self.data.append(sample)
self.target.append(accumulation)
if date_yesterday != date:
accumulation_yesterday0 = accumulation
self.date_list.append(date)
wind_speed_3days.insert(0, wind_speed)
if len(wind_speed_3days) > 3:
wind_speed_3days.pop()
temp_3days.insert(0, temp)
if len(temp_3days) > 3:
temp_3days.pop()
date_yesterday = date
self.record_count = len(self.data)
return self.data
def train(self):
u"""Generate a learning model. Use the training data up to about 7500 days of the original data"""
x = self.data
y = self.target
print(len(x))
# ElasticNetCV,LassoCV,Select Elastic NetCV with the smallest error from RidgeCV
model = linear_model.ElasticNetCV(fit_intercept=True)
model.fit(x[0:self.training_data_count()], y[0:self.training_data_count()])
self.model = model
def predict(self):
u"""Predict the amount of snowfall using a learning model. Forecast for the last two years"""
x = self.data
y = self.target
model = self.model
yesterday_predict_val = None #Variable to store yesterday's forecast value
for i, xi in enumerate(x):
real_val = y[i]
if i < self.training_data_count() + 1:
self.predicts.append(0)
self.reals.append(real_val)
self.day_counts.append(i)
continue
#Replace yesterday's snowfall with yesterday's forecast
if yesterday_predict_val != None:
xi[2] = yesterday_predict_val
predict_val = int(model.predict([xi])[0])
#If the snowfall forecast is 0 or less, it is set to 0.
if predict_val < 0:
predict_val = 0
self.predicts.append(predict_val)
self.reals.append(real_val)
self.day_counts.append(i)
yesterday_predict_val = predict_val
def show_graph(self):
u"""Compare predicted and measured values with a graph"""
pyplot.plot(self.day_counts[self.predict_start_num():], self.reals[self.predict_start_num():], "b")
pyplot.plot(self.day_counts[self.predict_start_num():], self.predicts[self.predict_start_num():], "r")
pyplot.show()
def check(self):
u"""Measure the error between training data and forecast data"""
x = np.array(self.data[self.predict_start_num():])
y = np.array(self.target[self.predict_start_num():])
model = self.model
p = np.array(self.predicts[self.predict_start_num():])
e = p - np.array(self.reals[self.predict_start_num():])
error = np.sum(e * e)
rmse_10cv = np.sqrt(error / len(self.data[self.predict_start_num():]))
print("RMSE(10-fold CV: {})".format(rmse_10cv))
def training_data_count(self):
u"""Leave the last two years and use the data before that as training data. Returns the number of training data"""
return self.record_count - 365 * 2
def predict_start_num(self):
u"""The last two years are predicted and used to measure the error from the measured value. Returns the predicted start position"""
return self.training_data_count() + 1
if __name__ == "__main__":
forecaster = SnowForecast()
forecaster.load_csv()
forecaster.train()
forecaster.predict()
forecaster.check()
forecaster.show_graph()
The result is as follows (blue is the actual amount of snow, red line is the predicted amount of snow). "Result 2" shown at the beginning.
Hmm. As expected, it became more inaccurate than when the actual amount of snow covered yesterday was given. However, it seems that the waveform is not so messed up.
I was wondering if it would be a more messed up prediction, but I thought I was able to predict it like that. However, although it was successfully deceived by Gefun Gefun on the way, the temperature and wind speed given when predicting are using the measured values of the day. However, if you want to make predictions for a certain period in the future, you have to use the predicted values separately or stop using those values in the first place, so if you use the predicted values, the accuracy will be higher. It will go down. Moreover, the more the future. So, if you want to do something like this, make a prediction using the predicted value, then make a prediction using it, and so on, and the later, the slight error in the previous process will greatly increase. thought. That's why the Japan Meteorological Agency does its best (
Recommended Posts