In this blog, we will use deep learning to predict the sales of snacks run by mothers. My mother, who is a manager, is always troubled by decisions about spending such as "inventory management, number of employees, capital investment, business expansion". .. .. Therefore, I thought that if I could make a highly accurate sales forecast, I would be able to reduce that difficulty as much as possible. It's been a month since I started studying machine learning, and I just remembered that I wanted to build a model with actual data. I'm a little worried if I can get it done, but ** I'd like to take on the challenge while also reviewing the past! !! !! ** Also, I hope it will be a reference article for those who are studying machine learning from now on.
There are many types of machine learning algorithms, but the underlying model building process was the same for all. The simple flow of machine learning is as follows. In this blog as well, we will build a model while keeping this flow in mind.
** 1. Data collection 2. Data preprocessing (remove duplicates and missing data to improve data accuracy) 3. Learn data using machine learning techniques 4. Test performance with test data **
This "model construction of snack sales forecast" corresponds to time series data analysis. Time series data is data that changes over time. Time-series data analysis can also be applied to forecasting company sales and product sales, so it is considered to be a very important analysis technology in the business scene. In this blog, we will consider the use of algorithms ** RNN (Recurrent Neural Network) ** and ** LSTM (Long-short-term-memory) ** that apply deep learning methods. RNN(Recurrent Neural Network) RNN is one of the machine learning algorithms that analyze time series data by deep learning. A feature of RNNs is that in the middle layer, the data at the previous time point is self-looped as the current input. This allows RNNs to convey information while preserving the context of the data in the middle layer. And this property made it possible to ** learn data with the concept of time **. (Quoted from https://qiita.com/KojiOhki/items/89cd7b69a8a6239d67ca)
It is an RNN that enables analysis of time series data by deep learning, but the performance is not so high. The cause is that the activation function is multiplied many times by the loop structure of the RNN. Over time, the activation function is repeatedly multiplied, resulting in a ** gradient disappearance ** or a ** gradient explosion ** in which the amount of computation exponentially increases. As a result, proper data processing becomes difficult. Also, for these reasons, we can see that RNNs are not suitable for learning long-term time series data. The deep learning model that solves this drawback is LSTM (Long-short-term-momory), which will be introduced next.
LSTM(Long-short-term-memory) In the LSTM model, by replacing the cells in the middle layer with LSTM blocks, the drawback of RNN that "learning while retaining long-term memory" is overcome. The basic configuration of the LETM block is as follows.
・ CEC: A unit that stores past data -Input gate: A gate that adjusts the input weight of the previous unit -Output gate: A gate that adjusts the output weight of the previous unit ・ Oblivion gate: A gate that adjusts how much the contents of the CEC containing past information are left.
(Quoted from https://sagantaf.hatenablog.com/entry/2019/06/04/225239)
In LSTM, the above-mentioned gate function enables deletion / addition of information according to the cell status. By adjusting the input / output weights and adjusting the data in the cell, the problems of gradient disappearance and gradient explosion, which were the drawbacks of RNNs, have been solved. Therefore, it can be applied to long-term time series data analysis.
The above explanation is the theoretical story of RNN and LSTM. There are a total of 82 data on the sales of this snack for 7 years. Since the length is medium- to long-term data, I would like to make predictions using both RNN and LSTM models and adopt a model with good results. Now let's actually build the model.
OS: Windows10
python environment: Jupyter Notebook
Forecast-
|-Forecast.py(python file)
|-sales_data-
|-Various CSV files
The flow of model construction is the following flow introduced earlier. ** 1. Data collection 2. Data preprocessing (remove duplicates and missing data to improve data accuracy) 3. Learn data using machine learning techniques 4. Test performance with test data **
First, import the required modules. Write the following code in the execution environment.
Forecast.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
First is data collection. After a desperate request, I got sales data for snacks run by my mother. The format of the 2013-2019 Excel data that summarizes the monthly sales is adjusted and output in CSV format.
Monthly snack sales record for 2013-2019.
sales
Number of data 8.400000e+01
Average 7.692972e+05
Standard deviation 1.001658e+05
Minimum value 5.382170e+05
1/Quartile 7.006952e+05
Median 7.594070e+05
3/Quartile 8.311492e+05
Maximum value 1.035008e+06
It seems that sales tend to increase as the years go by.
2019 :801197 yen
2018 :822819 yen
2017 :732294 yen
2016 :755799 yen
2015 :771255 yen
2014 :761587 yen
2013 :740128 yen
The highest sales were in December, followed by April. It seems that many drinking parties at the end of the year and drinking parties at the beginning of the year are held. I hope I can predict these trends.
January:758305 yen
February:701562 yen
March:750777 yen
April:805094 yen
May:785633 yen
June:778146 yen
July:752226 yen
August:763773 yen
September:689561 yen
October:765723 yen
November:779661 yen
December:901100 yen
It means a long-term trend of data. This theme will show whether snack sales are increasing or decreasing in the long run.
For data with periodic fluctuations, the value of the data repeatedly rises and falls with the passage of time. In particular, periodic fluctuations over a year are called seasonal fluctuations. Regarding this theme, we found that the average sales in December and April, when there are many drinking parties, are high. Perhaps there are seasonal periodic fluctuations.
trend_seasonal.py
#Consideration of year-round sales trends and seasonality
fig = sm.tsa.seasonal_decompose(df_sales_concat, freq=12).plot()
plt.show()
As expected, there was an upward trend in sales. There was also a periodic fluctuation in sales rising in April and December. It would be nice if we could predict such contents with a machine learning model.
Time series autocovariance is *** the covariance of the same time series data between different time points ***. The k-th order autocovariance is the covariance with data that is k-points away. The view of this autocovariance as a function of *** k is called the autocorrelation function ***. The *** graph representation of this function is called the *** correlogram.
corr.py
#Calculation of autocorrelation coefficient correlogram
df_sales_concat_acf = sm.tsa.stattools.acf(df_sales_concat, nlags=12)
print(df_sales_concat_acf)
sm.graphics.tsa.plot_acf(df_sales_concat, lags=12)
fig = sm.graphics.tsa.plot_acf(df_sales_concat, lags=12)
From the correlogram, we can see that the autocorrelation coefficient increases when k = 12. Since the data records monthly sales, you can see that there is a correlation between some data and the data one year ago.
The data actually used is published in the google sheets link below. https://docs.google.com/spreadsheets/d/1-eOPORhaGfSCdXCScSsBsM586yXkt3e_xbOlG2K6zN8/edit?usp=sharing
Forecast.py
#Read the CSV file in DataFrame format.
df_2019 = pd.read_csv('./sales_data/2019_sales.csv')
df_2018 = pd.read_csv('./sales_data/2018_sales.csv')
df_2017 = pd.read_csv('./sales_data/2017_sales.csv')
df_2016 = pd.read_csv('./sales_data/2016_sales.csv')
df_2015 = pd.read_csv('./sales_data/2015_sales.csv')
df_2014 = pd.read_csv('./sales_data/2014_sales.csv')
df_2013 = pd.read_csv('./sales_data/2013_sales.csv')
#Combine the read DataFrames into one DataFrame.
df_sales_concat = pd.concat([df_2013, df_2014, df_2015,df_2016,df_2017,df_2018,df_2019], axis=0)
#Create an index for the FrameData you want to use.
index = pd.date_range("2013-01", "2019-12-31", freq='M')
df_sales_concat.index = index
#Delete unnecessary DataFrame columns.
del df_sales_concat['month']
#Store only the actual sales data used in model building in the dataset variable.
dataset = df_sales_concat.values
dataset = dataset.astype('float32')
Next is data preprocessing. Specifically, we will create a dataset to be used for model construction.
Forecast.py
#Divide into training data and test data. The ratio is 2:1=training:It's a test.
train_size = int(len(dataset) * 0.67)
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :]
#Data scaling.
#I am creating an instance for data standardization based on training data.
scaler = MinMaxScaler(feature_range=(0, 1))
scaler_train = scaler.fit(train)
train_scale = scaler_train.transform(train)
test_scale = scaler_train.transform(test)
#Creating a dataset
look_back =1
train_X, train_Y = create_dataset(train_scale, look_back)
test_X, test_Y = create_dataset(test_scale, look_back)
#Creating an original dataset for evaluation
train_X_original, train_Y_original = create_dataset(train, look_back)
test_X_original, test_Y_original = create_dataset(test, look_back)
#Data formatting
train_X = train_X.reshape(train_X.shape[0], train_X.shape[1], 1)
test_X = test_X.reshape(test_X.shape[0], test_X.shape[1], 1)
Forecast.py
lstm_model = Sequential()
lstm_model.add(LSTM(64, return_sequences=True, input_shape=(look_back, 1)))
lstm_model.add(LSTM(32))
lstm_model.add(Dense(1))
lstm_model.compile(loss='mean_squared_error', optimizer='adam')
###Learning
lstm_model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)
Forecast.py
rnn_model = Sequential()
rnn_model.add(SimpleRNN(64, return_sequences=True, input_shape=(look_back, 1)))
rnn_model.add(SimpleRNN(32))
rnn_model.add(Dense(1))
rnn_model.compile(loss='mean_squared_error', optimizer='adam')
###Learning
rnn_model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)
This completes machine learning with LSTM and RNN from the dataset. The basic operations are the same for both, only the classes used are different. Next is the process for plotting the results on a graph. From now on, the model is displayed as model, but it refers to both lstm_model and rnn_model.
Forecast.py
#Creating forecast data
train_predict = model.predict(train_X)
test_predict = model.predict(test_X)
#Revert the scaled data. Converts standardized values to actual predicted values.
train_predict = scaler_train.inverse_transform(train_predict)
train_Y = scaler_train.inverse_transform([train_Y])
test_predict = scaler_train.inverse_transform(test_predict)
test_Y = scaler_train.inverse_transform([test_Y])
#Calculation of prediction accuracy
train_score = math.sqrt(mean_squared_error(train_Y_original, train_predict[:, 0]))
print(train_score)
print('Train Score: %.2f RMSE' % (train_score))
test_score = math.sqrt(mean_squared_error(test_Y_original, test_predict[:, 0]))
print('Test Score: %.2f RMSE' % (test_score))
#Data shaping for plots
train_predict_plot = np.empty_like(dataset)
train_predict_plot[:, :] = np.nan
train_predict_plot[look_back:len(train_predict)+look_back, :] = train_predict
train_predict_plot = pd.DataFrame({'sales':list(train_predict_plot.reshape(train_predict_plot.shape[0],))})
train_predict_plot.index = index
test_predict_plot = np.empty_like(dataset)
test_predict_plot[:, :] = np.nan
test_predict_plot[len(train_predict)+(look_back*2):len(dataset), :] = test_predict
test_predict_plot = pd.DataFrame({'sales':list(test_predict_plot.reshape(test_predict_plot.shape[0],))})
test_predict_plot.index = index
Next, we will plot the actual data on the graph.
Forecast.py
#Output the meta information of the graph
plt.title("monthly-sales")
plt.xlabel("time(month)")
plt.ylabel("sales")
#Plot the data
plt.plot(dataset, label='sales_dataset', c='green')
plt.plot(train_predict_plot, label='train_data', c='red')
plt.plot(test_predict_plot, label='test_data', c='blue')
#Adjust the y-axis scale
plt.yticks([500000, 600000, 700000, 800000, 900000, 1000000, 1100000])
#Plot the graph
plt.legend()
plt.show()
The graphs of RNN and LSTM output as a result are posted respectively.
The value of the output is completely lost. .. .. I changed the parameters a lot, but the results didn't change much. It can be seen that the RNN method is not suitable even for the length of time of 84 data this time.
Compared to the RNN forecast, we can somehow predict the sales trend. In particular, we have been able to control the tendency for sales to increase in April and December. However, as a whole, there are many points that deviate significantly from the measured values. The *** RMSE ***, which is the standard for the goodness of the model, is also *** Train Score: 94750.73 RMSE, Test Score: 115472.92 RMSE ***, which are quite large values. It seems that the result is not so good.
It was found that LSTM can realize detailed time series analysis even for datasets that cause gradient disappearance in RNN. However, the LSTM forecasts only show sales trends, and there are many values that deviate significantly from the actual values. This is not a highly accurate sales forecast, and cannot be a forecast of income data that can be a management decision.
For both RNN and LSTM, I tried various patterns of parameters loop_back, epochs, batch_size, but the performance did not improve in particular. The cause can be assumed to be ** small amount of data and variation. ** When I investigated later, it seems that the number of data 84 in time series analysis is quite small. I realized that ** data is the life ** in building a machine learning model.
However, anyway, I think it was a good teaching material for a month's review of machine learning and beginners. From now on, I will do my best so that I can improve my level as a machine learning engineer! *** ***
Forecast.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
#Creating a dataset
def create_dataset(dataset, look_back):
data_X, data_Y = [], []
for i in range(look_back, len(dataset)):
data_X.append(dataset[i-look_back:i, 0])
data_Y.append(dataset[i, 0])
return np.array(data_X), np.array(data_Y)
df_2019 = pd.read_csv('./sales_data/2019_sales.csv')
df_2018 = pd.read_csv('./sales_data/2018_sales.csv')
df_2017 = pd.read_csv('./sales_data/2017_sales.csv')
df_2016 = pd.read_csv('./sales_data/2016_sales.csv')
df_2015 = pd.read_csv('./sales_data/2015_sales.csv')
df_2014 = pd.read_csv('./sales_data/2014_sales.csv')
df_2013 = pd.read_csv('./sales_data/2013_sales.csv')
df_sales_concat = pd.concat([df_2013, df_2014, df_2015,df_2016,df_2017,df_2018,df_2019], axis=0)
index = pd.date_range("2013-01", "2019-12-31", freq='M')
df_sales_concat.index = index
del df_sales_concat['month']
dataset = df_sales_concat.values
dataset = dataset.astype('float32')
#Divide into training data and test data
train_size = int(len(dataset) * 0.67)
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :]
#Data scaling
scaler = MinMaxScaler(feature_range=(0, 1))
scaler_train = scaler.fit(train)
train_scale = scaler_train.transform(train)
test_scale = scaler_train.transform(test)
#Data creation
look_back =1
train_X, train_Y = create_dataset(train_scale, look_back)
test_X, test_Y = create_dataset(test_scale, look_back)
#Creating a dataset for evaluation
train_X_original, train_Y_original = create_dataset(train, look_back)
test_X_original, test_Y_original = create_dataset(test, look_back)
#Data formatting
train_X = train_X.reshape(train_X.shape[0], train_X.shape[1], 1)
test_X = test_X.reshape(test_X.shape[0], test_X.shape[1], 1)
#Creating and training LSTM models
model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(look_back, 1)))
model.add(LSTM(32))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)
#Creating forecast data
train_predict = model.predict(train_X)
test_predict = model.predict(test_X)
#Revert the scaled data.
train_predict = scaler_train.inverse_transform(train_predict)
train_Y = scaler_train.inverse_transform([train_Y])
test_predict = scaler_train.inverse_transform(test_predict)
test_Y = scaler_train.inverse_transform([test_Y])
#Calculation of prediction accuracy
train_score = math.sqrt(mean_squared_error(train_Y_original, train_predict[:, 0]))
print(train_score)
print('Train Score: %.2f RMSE' % (train_score))
test_score = math.sqrt(mean_squared_error(test_Y_original, test_predict[:, 0]))
print('Test Score: %.2f RMSE' % (test_score))
#Data shaping for plots
train_predict_plot = np.empty_like(dataset)
train_predict_plot[:, :] = np.nan
train_predict_plot[look_back:len(train_predict)+look_back, :] = train_predict
train_predict_plot = pd.DataFrame({'sales':list(train_predict_plot.reshape(train_predict_plot.shape[0],))})
train_predict_plot.index = index
test_predict_plot = np.empty_like(dataset)
test_predict_plot[:, :] = np.nan
test_predict_plot[len(train_predict)+(look_back*2):len(dataset), :] = test_predict
test_predict_plot = pd.DataFrame({'sales':list(test_predict_plot.reshape(test_predict_plot.shape[0],))})
test_predict_plot.index = index
#Data plot
plt.title("monthly-sales")
plt.xlabel("time(month)")
plt.ylabel("sales")
plt.plot(dataset, label='sales_dataset', c='green')
plt.plot(train_predict_plot, label='train_data', c='red')
plt.plot(test_predict_plot, label='test_data', c='blue')
plt.yticks([500000, 600000, 700000, 800000, 900000, 1000000, 1100000])
plt.legend()
plt.show()
Recommended Posts