Purpose of the blog

In this blog, we will use deep learning to predict the sales of snacks run by mothers. My mother, who is a manager, is always troubled by decisions about spending such as "inventory management, number of employees, capital investment, business expansion". .. .. Therefore, I thought that if I could make a highly accurate sales forecast, I would be able to reduce that difficulty as much as possible. It's been a month since I started studying machine learning, and I just remembered that I wanted to build a model with actual data. I'm a little worried if I can get it done, but ** I'd like to take on the challenge while also reviewing the past! !! !! ** Also, I hope it will be a reference article for those who are studying machine learning from now on.

Flow of machine learning model construction

There are many types of machine learning algorithms, but the underlying model building process was the same for all. The simple flow of machine learning is as follows. In this blog as well, we will build a model while keeping this flow in mind.

** 1. Data collection 2. Data preprocessing (remove duplicates and missing data to improve data accuracy) 3. Learn data using machine learning techniques 4. Test performance with test data **

About time series data analysis

This "model construction of snack sales forecast" corresponds to time series data analysis. Time series data is data that changes over time. Time-series data analysis can also be applied to forecasting company sales and product sales, so it is considered to be a very important analysis technology in the business scene. In this blog, we will consider the use of algorithms ** RNN (Recurrent Neural Network) ** and ** LSTM (Long-short-term-memory) ** that apply deep learning methods. 　 RNN(Recurrent Neural Network) RNN is one of the machine learning algorithms that analyze time series data by deep learning. A feature of RNNs is that in the middle layer, the data at the previous time point is self-looped as the current input. This allows RNNs to convey information while preserving the context of the data in the middle layer. And this property made it possible to ** learn data with the concept of time **. (Quoted from https://qiita.com/KojiOhki/items/89cd7b69a8a6239d67ca)

Disadvantages of RNN

It is an RNN that enables analysis of time series data by deep learning, but the performance is not so high. The cause is that the activation function is multiplied many times by the loop structure of the RNN. Over time, the activation function is repeatedly multiplied, resulting in a ** gradient disappearance ** or a ** gradient explosion ** in which the amount of computation exponentially increases. As a result, proper data processing becomes difficult. Also, for these reasons, we can see that RNNs are not suitable for learning long-term time series data. The deep learning model that solves this drawback is LSTM (Long-short-term-momory), which will be introduced next.

LSTM(Long-short-term-memory) In the LSTM model, by replacing the cells in the middle layer with LSTM blocks, the drawback of RNN that "learning while retaining long-term memory" is overcome. The basic configuration of the LETM block is as follows.

・ CEC: A unit that stores past data -Input gate: A gate that adjusts the input weight of the previous unit -Output gate: A gate that adjusts the output weight of the previous unit ・ Oblivion gate: A gate that adjusts how much the contents of the CEC containing past information are left.

(Quoted from https://sagantaf.hatenablog.com/entry/2019/06/04/225239)

In LSTM, the above-mentioned gate function enables deletion / addition of information according to the cell status. By adjusting the input / output weights and adjusting the data in the cell, the problems of gradient disappearance and gradient explosion, which were the drawbacks of RNNs, have been solved. Therefore, it can be applied to long-term time series data analysis.

The above explanation is the theoretical story of RNN and LSTM. There are a total of 82 data on the sales of this snack for 7 years. Since the length is medium- to long-term data, I would like to make predictions using both RNN and LSTM models and adopt a model with good results. Now let's actually build the model.

Development environment

OS: Windows10
python environment: Jupyter Notebook

file organization

Forecast-
        |-Forecast.py(python file)
        |-sales_data-
                    |-Various CSV files

Model construction flow

The flow of model construction is the following flow introduced earlier. ** 1. Data collection 2. Data preprocessing (remove duplicates and missing data to improve data accuracy) 3. Learn data using machine learning techniques 4. Test performance with test data **

0. Required modules

First, import the required modules. Write the following code in the execution environment.

`Forecast.py`


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

1. Data collection

First is data collection. After a desperate request, I got sales data for snacks run by my mother. The format of the 2013-2019 Excel data that summarizes the monthly sales is adjusted and output in CSV format.

About collected data

Data description

Monthly snack sales record for 2013-2019.

Basic statistics

          　　　	sales
Number of data 8.400000e+01
Average 7.692972e+05
Standard deviation 1.001658e+05
Minimum value 5.382170e+05
1/Quartile 7.006952e+05
Median 7.594070e+05
3/Quartile 8.311492e+05
Maximum value 1.035008e+06

Average sales by year

It seems that sales tend to increase as the years go by.

2019  :801197 yen
2018  :822819 yen
2017  :732294 yen
2016  :755799 yen
2015  :771255 yen
2014  :761587 yen
2013  :740128 yen

Monthly average sales

The highest sales were in December, followed by April. It seems that many drinking parties at the end of the year and drinking parties at the beginning of the year are held. I hope I can predict these trends.

January:758305 yen
February:701562 yen
March:750777 yen
April:805094 yen
May:785633 yen
June:778146 yen
July:752226 yen
August:763773 yen
September:689561 yen
October:765723 yen
November:779661 yen
December:901100 yen

Consideration of time-series periodic fluctuations and trends

What is a trend?

It means a long-term trend of data. This theme will show whether snack sales are increasing or decreasing in the long run.

What is periodic fluctuation?

For data with periodic fluctuations, the value of the data repeatedly rises and falls with the passage of time. In particular, periodic fluctuations over a year are called seasonal fluctuations. Regarding this theme, we found that the average sales in December and April, when there are many drinking parties, are high. Perhaps there are seasonal periodic fluctuations.

`trend_seasonal.py`


#Consideration of year-round sales trends and seasonality
fig = sm.tsa.seasonal_decompose(df_sales_concat, freq=12).plot()
plt.show()

As expected, there was an upward trend in sales. There was also a periodic fluctuation in sales rising in April and December. It would be nice if we could predict such contents with a machine learning model.

Time series autocovariance

Time series autocovariance is *** the covariance of the same time series data between different time points ***. The k-th order autocovariance is the covariance with data that is k-points away. The view of this autocovariance as a function of *** k is called the autocorrelation function ***. The *** graph representation of this function is called the *** correlogram.

`corr.py`


#Calculation of autocorrelation coefficient correlogram
df_sales_concat_acf = sm.tsa.stattools.acf(df_sales_concat, nlags=12)
print(df_sales_concat_acf)
sm.graphics.tsa.plot_acf(df_sales_concat, lags=12)
fig = sm.graphics.tsa.plot_acf(df_sales_concat, lags=12)

From the correlogram, we can see that the autocorrelation coefficient increases when k = 12. Since the data records monthly sales, you can see that there is a correlation between some data and the data one year ago.

The data actually used is published in the google sheets link below. https://docs.google.com/spreadsheets/d/1-eOPORhaGfSCdXCScSsBsM586yXkt3e_xbOlG2K6zN8/edit?usp=sharing

`Forecast.py`


#Read the CSV file in DataFrame format.
df_2019 = pd.read_csv('./sales_data/2019_sales.csv')
df_2018 = pd.read_csv('./sales_data/2018_sales.csv')
df_2017 = pd.read_csv('./sales_data/2017_sales.csv')
df_2016 = pd.read_csv('./sales_data/2016_sales.csv')
df_2015 = pd.read_csv('./sales_data/2015_sales.csv')
df_2014 = pd.read_csv('./sales_data/2014_sales.csv')
df_2013 = pd.read_csv('./sales_data/2013_sales.csv')

#Combine the read DataFrames into one DataFrame.
df_sales_concat = pd.concat([df_2013, df_2014, df_2015,df_2016,df_2017,df_2018,df_2019], axis=0)

#Create an index for the FrameData you want to use.
index = pd.date_range("2013-01", "2019-12-31", freq='M')
df_sales_concat.index = index

#Delete unnecessary DataFrame columns.
del df_sales_concat['month']

#Store only the actual sales data used in model building in the dataset variable.
dataset = df_sales_concat.values
dataset = dataset.astype('float32')

2. Data preprocessing (data set creation)

Next is data preprocessing. Specifically, we will create a dataset to be used for model construction.

`Forecast.py`


#Divide into training data and test data. The ratio is 2:1=training:It's a test.
train_size = int(len(dataset) * 0.67)
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :]

#Data scaling.
#I am creating an instance for data standardization based on training data.
scaler = MinMaxScaler(feature_range=(0, 1))
scaler_train = scaler.fit(train)
train_scale = scaler_train.transform(train)
test_scale = scaler_train.transform(test)

#Creating a dataset
look_back =1
train_X, train_Y = create_dataset(train_scale, look_back)
test_X, test_Y = create_dataset(test_scale, look_back)

#Creating an original dataset for evaluation
train_X_original, train_Y_original = create_dataset(train, look_back)
test_X_original, test_Y_original = create_dataset(test, look_back)

#Data formatting
train_X = train_X.reshape(train_X.shape[0], train_X.shape[1], 1)
test_X = test_X.reshape(test_X.shape[0], test_X.shape[1], 1)

3. Building and learning LSTM and RNN models

LSTM model construction

`Forecast.py`


lstm_model = Sequential()
lstm_model.add(LSTM(64, return_sequences=True, input_shape=(look_back, 1)))
lstm_model.add(LSTM(32))
lstm_model.add(Dense(1))
lstm_model.compile(loss='mean_squared_error', optimizer='adam')
###Learning
lstm_model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)

RNN model construction

`Forecast.py`


rnn_model = Sequential()
rnn_model.add(SimpleRNN(64, return_sequences=True, input_shape=(look_back, 1)))
rnn_model.add(SimpleRNN(32))
rnn_model.add(Dense(1))
rnn_model.compile(loss='mean_squared_error', optimizer='adam')
###Learning
rnn_model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)

This completes machine learning with LSTM and RNN from the dataset. The basic operations are the same for both, only the classes used are different. Next is the process for plotting the results on a graph. From now on, the model is displayed as model, but it refers to both lstm_model and rnn_model.

`Forecast.py`


#Creating forecast data
train_predict = model.predict(train_X)
test_predict = model.predict(test_X)

#Revert the scaled data. Converts standardized values to actual predicted values.
train_predict = scaler_train.inverse_transform(train_predict)
train_Y = scaler_train.inverse_transform([train_Y])
test_predict = scaler_train.inverse_transform(test_predict)
test_Y = scaler_train.inverse_transform([test_Y])


#Calculation of prediction accuracy
train_score = math.sqrt(mean_squared_error(train_Y_original, train_predict[:, 0]))
print(train_score)
print('Train Score: %.2f RMSE' % (train_score))
test_score = math.sqrt(mean_squared_error(test_Y_original, test_predict[:, 0]))
print('Test Score: %.2f RMSE' % (test_score))

#Data shaping for plots
train_predict_plot = np.empty_like(dataset)
train_predict_plot[:, :] = np.nan
train_predict_plot[look_back:len(train_predict)+look_back, :] = train_predict
train_predict_plot = pd.DataFrame({'sales':list(train_predict_plot.reshape(train_predict_plot.shape[0],))})
train_predict_plot.index = index
test_predict_plot = np.empty_like(dataset)
test_predict_plot[:, :] = np.nan
test_predict_plot[len(train_predict)+(look_back*2):len(dataset), :] = test_predict
test_predict_plot = pd.DataFrame({'sales':list(test_predict_plot.reshape(test_predict_plot.shape[0],))})
test_predict_plot.index = index

Next, we will plot the actual data on the graph.

`Forecast.py`


#Output the meta information of the graph
plt.title("monthly-sales")
plt.xlabel("time(month)")
plt.ylabel("sales")
#Plot the data
plt.plot(dataset, label='sales_dataset', c='green')
plt.plot(train_predict_plot, label='train_data', c='red')
plt.plot(test_predict_plot, label='test_data', c='blue')
#Adjust the y-axis scale
plt.yticks([500000, 600000, 700000, 800000, 900000, 1000000, 1100000])
#Plot the graph
plt.legend()
plt.show()

4. Result

The graphs of RNN and LSTM output as a result are posted respectively.

Forecast by RNN

The value of the output is completely lost. .. .. I changed the parameters a lot, but the results didn't change much. It can be seen that the RNN method is not suitable even for the length of time of 84 data this time.

Forecast by LSTM

Compared to the RNN forecast, we can somehow predict the sales trend. In particular, we have been able to control the tendency for sales to increase in April and December. However, as a whole, there are many points that deviate significantly from the measured values. The *** RMSE ***, which is the standard for the goodness of the model, is also *** Train Score: 94750.73 RMSE, Test Score: 115472.92 RMSE ***, which are quite large values. It seems that the result is not so good.

Summary

It was found that LSTM can realize detailed time series analysis even for datasets that cause gradient disappearance in RNN. However, the LSTM forecasts only show sales trends, and there are many values that deviate significantly from the actual values. This is not a highly accurate sales forecast, and cannot be a forecast of income data that can be a management decision.

Consideration

For both RNN and LSTM, I tried various patterns of parameters loop_back, epochs, batch_size, but the performance did not improve in particular. The cause can be assumed to be ** small amount of data and variation. ** When I investigated later, it seems that the number of data 84 in time series analysis is quite small. I realized that ** data is the life ** in building a machine learning model.

However, anyway, I think it was a good teaching material for a month's review of machine learning and beginners. From now on, I will do my best so that I can improve my level as a machine learning engineer! *** ***

All codes used this time

`Forecast.py`


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

#Creating a dataset
def create_dataset(dataset, look_back):
    data_X, data_Y = [], []
    for i in range(look_back, len(dataset)):
        data_X.append(dataset[i-look_back:i, 0])
        data_Y.append(dataset[i, 0])
    return np.array(data_X), np.array(data_Y)


df_2019 = pd.read_csv('./sales_data/2019_sales.csv')
df_2018 = pd.read_csv('./sales_data/2018_sales.csv')
df_2017 = pd.read_csv('./sales_data/2017_sales.csv')
df_2016 = pd.read_csv('./sales_data/2016_sales.csv')
df_2015 = pd.read_csv('./sales_data/2015_sales.csv')
df_2014 = pd.read_csv('./sales_data/2014_sales.csv')
df_2013 = pd.read_csv('./sales_data/2013_sales.csv')


df_sales_concat = pd.concat([df_2013, df_2014, df_2015,df_2016,df_2017,df_2018,df_2019], axis=0)

index = pd.date_range("2013-01", "2019-12-31", freq='M')
df_sales_concat.index = index
del df_sales_concat['month']

dataset = df_sales_concat.values
dataset = dataset.astype('float32')

#Divide into training data and test data
train_size = int(len(dataset) * 0.67)
train, test = dataset[0:train_size, :], dataset[train_size:len(dataset), :]

#Data scaling
scaler = MinMaxScaler(feature_range=(0, 1))
scaler_train = scaler.fit(train)
train_scale = scaler_train.transform(train)
test_scale = scaler_train.transform(test)

#Data creation
look_back =1
train_X, train_Y = create_dataset(train_scale, look_back)
test_X, test_Y = create_dataset(test_scale, look_back)

#Creating a dataset for evaluation
train_X_original, train_Y_original = create_dataset(train, look_back)
test_X_original, test_Y_original = create_dataset(test, look_back)

#Data formatting
train_X = train_X.reshape(train_X.shape[0], train_X.shape[1], 1)
test_X = test_X.reshape(test_X.shape[0], test_X.shape[1], 1)

#Creating and training LSTM models
model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(look_back, 1)))
model.add(LSTM(32))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(train_X, train_Y, epochs=100, batch_size=64, verbose=2)

#Creating forecast data
train_predict = model.predict(train_X)
test_predict = model.predict(test_X)

#Revert the scaled data.
train_predict = scaler_train.inverse_transform(train_predict)
train_Y = scaler_train.inverse_transform([train_Y])
test_predict = scaler_train.inverse_transform(test_predict)
test_Y = scaler_train.inverse_transform([test_Y])


#Calculation of prediction accuracy
train_score = math.sqrt(mean_squared_error(train_Y_original, train_predict[:, 0]))
print(train_score)
print('Train Score: %.2f RMSE' % (train_score))
test_score = math.sqrt(mean_squared_error(test_Y_original, test_predict[:, 0]))
print('Test Score: %.2f RMSE' % (test_score))

#Data shaping for plots
train_predict_plot = np.empty_like(dataset)
train_predict_plot[:, :] = np.nan
train_predict_plot[look_back:len(train_predict)+look_back, :] = train_predict
train_predict_plot = pd.DataFrame({'sales':list(train_predict_plot.reshape(train_predict_plot.shape[0],))})
train_predict_plot.index = index

test_predict_plot = np.empty_like(dataset)
test_predict_plot[:, :] = np.nan
test_predict_plot[len(train_predict)+(look_back*2):len(dataset), :] = test_predict
test_predict_plot = pd.DataFrame({'sales':list(test_predict_plot.reshape(test_predict_plot.shape[0],))})
test_predict_plot.index = index

#Data plot
plt.title("monthly-sales")
plt.xlabel("time(month)")
plt.ylabel("sales")

plt.plot(dataset, label='sales_dataset', c='green')
plt.plot(train_predict_plot, label='train_data', c='red')
plt.plot(test_predict_plot, label='test_data', c='blue')

plt.yticks([500000, 600000, 700000, 800000, 900000, 1000000, 1100000])

plt.legend()
plt.show()

Forecasting Snack Sales with Deep Learning

Purpose of the blog

Flow of machine learning model construction

About time series data analysis

Disadvantages of RNN

Development environment

file organization

Model construction flow

0. Required modules

Forecast.py

1. Data collection

About collected data

Data description

Basic statistics

Average sales by year

Monthly average sales

Consideration of time-series periodic fluctuations and trends

What is a trend?

What is periodic fluctuation?

trend_seasonal.py

Time series autocovariance

corr.py

Forecast.py

2. Data preprocessing (data set creation)

Forecast.py

3. Building and learning LSTM and RNN models

LSTM model construction

Forecast.py

RNN model construction

Forecast.py

Forecast.py

Forecast.py

4. Result

Forecast by RNN

Forecast by LSTM

Summary

Consideration

All codes used this time

Forecast.py

`Forecast.py`

`trend_seasonal.py`

`corr.py`

`Forecast.py`

`Forecast.py`

`Forecast.py`

`Forecast.py`

`Forecast.py`

`Forecast.py`

`Forecast.py`