Anomaly detection of time series data by LSTM (Keras)

Overview

Anomaly detection of time series data by LSTM is performed.

Implemented using Keras while referring to the contents of ReNom Tutorial in Reference [1]. ..

The general flow is as follows.

Step1 Create a model that predicts future values from past data using normal data.

Step2 Predict with test data using the created model and sample the error vector. Then, the normal distribution is fitted to the error vector.

Step3 Calculate the error vector of the part where the abnormality seems to have occurred, and if there is an error vector at the tail of the normal distribution estimated in Step 2, it is judged that the abnormality has occurred.

Import required libraries

%matplotlib inline
import keras
import matplotlib.pyplot as plt
import pandas as pd 
import numpy as np 
from sklearn.preprocessing import StandardScaler

Data confirmation and preprocessing

Use the electrocardiogram data called qtdb / sel102 ECG dataset [2].

The data can be obtained at:

http://www.cs.ucr.edu/~eamonn/discords/qtdbsel102.txt

df = pd.read_csv('qtdbsel102.txt', header=None, delimiter='\t')

print(df.shape)
(45000, 3)

This time, we will focus on the data in the third column to detect anomalies.

ecg = df.iloc[:,2].values
ecg = ecg.reshape(len(ecg), -1)
print("length of ECG data:", len(ecg))
length of ECG data: 45000

As data preprocessing, set the mean to 0 and the variance to 1.

scaler = StandardScaler()
std_ecg = scaler.fit_transform(ecg)

Confirmation of trends

Next, the data is illustrated to confirm the tendency.

plt.figure()

plt.style.use('ggplot')
plt.figure(figsize=(15,5))
plt.xlabel('time')
plt.ylabel('ECG\'s value')
plt.plot(np.arange(45000), std_ecg[:45000], color='b')
plt.legend()

plt.show()

image.png

Although it is periodic as a whole, it can be seen that the shape is broken in one place before 5000 steps.

So, let's expand the graph from 0 steps to 5000 steps.

Confirmation of abnormal parts

plt.figure()

plt.style.use('ggplot')
plt.figure(figsize=(15,5))
plt.xlabel('time')
plt.ylabel('ECG\'s value')
plt.plot(np.arange(5000), std_ecg[:5000], color='b')
plt.ylim(-3, 3)
x = np.arange(4200,4400)
y1 = [-3]*len(x)
y2 = [3]*len(x)
plt.fill_between(x, y1, y2, facecolor='g', alpha=.3)
plt.legend()

plt.show()

image.png

It can be seen that the periodicity is broken at the part painted in green.

The specific goals of this time are as follows.

-** The green part is judged as abnormal and detected. ** **

Extraction of normal data

In order to learn only with normal data, data after 5000 steps is taken out.

normal_cycle = std_ecg[5000:]

Generator for data partitioning

Prepare the following generators to divide the possessed data into training data, verification data, and test data.

Here, the following 3 data are predicted from the 10 data.

Note) delay and step are not used this time. It means "predict the data after delay" and "use the data every step", respectively.

def generator(data, lookback, delay, pred_length, min_index, max_index, shuffle=False,
              batch_size=100, step=1):
    if max_index is None:
        max_index = len(data) - delay - pred_length - 1 
    i = min_index + lookback 
    
    while 1:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, 
                                    size=batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback
            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)
            
        samples = np.zeros((len(rows),
                               lookback//step,
                               data.shape[-1]))

        targets = np.zeros((len(rows), pred_length))

        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay : rows[j] + delay + pred_length].flatten()

        yield samples, targets
lookback = 10
pred_length = 3
step = 1
delay = 1
batch_size = 100

#Training generator
train_gen = generator(normal_cycle, 
                     lookback=lookback,
                     pred_length=pred_length,
                     delay=delay,
                     min_index=0,
                     max_index=20000,
                     shuffle=True,
                     step=step,
                     batch_size=batch_size)

val_gen = generator(normal_cycle, 
                   lookback=lookback,
                   pred_length=pred_length,
                   delay=delay,
                   min_index=20001,
                   max_index=30000,
                   step=step,
                   batch_size=batch_size)


#Val to examine the entire validation dataset_Number of time increments to extract from gen
val_steps = (30001 - 20001 -lookback) // batch_size

#Others are test data

Learning

Here, a model in which two LSTMs are stacked is used.

LSTM is a neural network suitable for time series data.

Only the final 5 steps will be posted.

import keras
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop

model = Sequential()
model.add(layers.LSTM(35, return_sequences = True, input_shape=(None,normal_cycle.shape[-1])))
model.add(layers.LSTM(35))
model.add(layers.Dense(pred_length))
          
model.compile(optimizer=RMSprop(), loss="mse")
history = model.fit_generator(train_gen,
                              steps_per_epoch=200,
                              epochs=60,
                              validation_data=val_gen,
                              validation_steps=val_steps)
Epoch 56/60
200/200 [==============================] - 3s 17ms/step - loss: 0.0037 - val_loss: 0.0043
Epoch 57/60
200/200 [==============================] - 3s 17ms/step - loss: 0.0039 - val_loss: 0.0045
Epoch 58/60
200/200 [==============================] - 3s 17ms/step - loss: 0.0036 - val_loss: 0.0047
Epoch 59/60
200/200 [==============================] - 3s 17ms/step - loss: 0.0036 - val_loss: 0.0043
Epoch 60/60
200/200 [==============================] - 3s 17ms/step - loss: 0.0036 - val_loss: 0.0038

The transition of the loss value in the training data and the verification data is shown below.

loss = history.history["loss"]
val_loss = history.history["val_loss"]

epochs = range(len(loss))

plt.figure(figsize=(10,5))

plt.plot(epochs, loss, "b", label="Training loss")
plt.plot(epochs, val_loss, "r", label="Validatioin loss")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.title("Training and validation loss")
plt.legend(fontsize=20)

plt.show()

image.png

Evaluation of the model with test data

Prepare the predicted value and the target value in the test data.

test_gen_pred = generator(normal_cycle, 
                    lookback=lookback,
                    pred_length=pred_length,
                    delay=delay,
                    min_index=30001,
                    max_index=None,
                    step=step,
                    batch_size=batch_size)

test_steps = (len(normal_cycle) - 30001 - lookback) // batch_size

test_pred = model.predict_generator(test_gen_pred, steps=test_steps)
test_gen_target = generator(normal_cycle, 
                    lookback=lookback,
                    pred_length=pred_length,
                    delay=delay,
                    min_index=30001,
                    max_index=None,
                    step=step,
                    batch_size=batch_size)

test_target = np.zeros((test_steps * batch_size , pred_length))

for i in range(test_steps):
    test_target[i*batch_size:(i+1)*batch_size] = next(test_gen_target)[1]

The mse is evaluated below. It can be confirmed that the model is appropriate.

from sklearn.metrics import mean_squared_error

print(mean_squared_error(test_pred, test_target))
0.0032039127006786993

Just in case, check the graph as well.

plt.figure(figsize=(15,5))


plt.plot(range(len(test_pred[0:2000,0])), test_pred[0:2000,0], "r", label="Prediction")
plt.plot(range(len(test_target[0:2000,0])), test_target[0:2000,0], "b", label="Target")
plt.xlabel("time")
plt.ylabel('ECG\'s value')
plt.legend(fontsize=10)

plt.show()

image.png

plt.figure(figsize=(15,5))

plt.plot(range(len(test_pred[:,0])), test_pred[:,0], "r", label="Prediction")
plt.plot(range(len(test_target[:,0])), test_target[:,0], "b", label="Target")
plt.xlabel("time")
plt.ylabel('ECG\'s value')
plt.legend(fontsize=10)

plt.show()

image.png

Normal distribution fitting

Assume that the error $ e $ between the predicted value and the target value follows a normal distribution. However, since we predicted three data, $ e $ is a three-dimensional vector, and we need to consider a three-dimensional normal distribution.

\begin{align*}
    p(e) = N(e| \mu, \Sigma)
\end{align*}

Where $ \ mu $ is the mean vector and $ \ Sigma $ is the covariance matrix. In general, a multidimensional normal distribution is determined by the mean vector and the variance-covariance matrix.

Parameter estimation is performed from the sampled data. Currently, the error in the test data can be acquired for 10,000 steps.

\begin{align*}
    \{ e_1, e_2, ..., e_{10000} \}
\end{align*}

In addition, each element is

\begin{align*}
    e_i = (e_i ^{(1)}, e_i ^{(2)}, e_i ^{(3)} )
\end{align*}

Is.

Here, the maximum likelihood method is adopted as the method of parameter estimation. The maximum likelihood estimator of the normal distribution is as follows.

\begin{align*}
    & \hat{\mu} = \frac{1}{N}\sum_{n=1}^{N}e_n \\
    & \hat{\Sigma} = \frac{1}{N}\sum_{n=1}^{N}(e_n-\hat{\mu})(e_n-\hat{\mu})^{\top}
\end{align*}

Note that the estimator of the variance-covariance matrix by the most likely method is a sample variance-covariance matrix rather than an invariant-variance-covariance matrix.

With this data, we will calculate each estimator.

error =  test_pred - test_target
mean = np.mean(error, axis=0)
print(mean)
cov = np.cov(error, rowvar=False, bias=True)
print(cov)
[0.00562191 0.00312449 0.00928306]
[[0.00149367 0.0016201  0.0013881 ]
 [0.0016201  0.00302257 0.00310156]
 [0.0013881  0.00310156 0.00496796]]

The reason why rowbar = False is to make the direction of the sum vertical. Also, the reason that bias = True is to find the sample covariance matrix. By default, bias = False. In this case, the invariant covariance matrix can be obtained.

Mahalanobis distance

The Mahalanobis distance is defined below using the previous estimates.

\begin{align*}
    a(e) = (e-\hat{\mu})^\top \hat{\Sigma}^{-1}(e-\hat{\mu})
\end{align*}

It is difficult to understand if it is multidimensional, but if it is one-dimensional, it will be as follows. However, $ {\ hat {\ sigma} ^ 2} $ is the sample variance.

\begin{align*}
    a(e) = \frac{(e-\hat{\mu})^2}{\hat{\sigma}^2}
\end{align*}

In other words, $ a (e) $ represents how far the sample data is from the mean value, and in the normal distribution, if this value is large, the probability of occurrence is low, in other words, an abnormal event has occurred. be able to. Thereby, an abnormal value can be detected.

def Mahalanobis_dist(x, mean, cov):
    d = np.dot(x-mean, np.linalg.inv(cov))
    d = np.dot(d, (x-mean).T)
    return d

Outlier detection

detection_gen_pred = generator(std_ecg, 
                   lookback=lookback,
                   pred_length=pred_length,
                   delay=delay,
                   min_index=0,
                   max_index=5000,
                   step=step,
                   batch_size=batch_size)

detection_steps = (5000 -lookback) // batch_size

detection_pred = model.predict_generator(detection_gen_pred, steps=detection_steps)
detection_gen_target = generator(std_ecg, 
                   lookback=lookback,
                   pred_length=pred_length,
                   delay=delay,
                   min_index=0,
                   max_index=5000,
                   step=step,
                   batch_size=batch_size)

detection_target = np.zeros((detection_steps * batch_size , pred_length))

for i in range(detection_steps):
    detection_target[i*batch_size:(i+1)*batch_size] = next(detection_gen_target)[1]

Prepare the error vector and find the Mahalanobis distance.

error_detection = detection_pred - detection_target 

m_dist = []

for e in error_detection:
    m_dist.append(Mahalanobis_dist(e, mean, cov))

The result is as follows.

fig, axes = plt.subplots(nrows=2, figsize=(15,10))

axes[0].plot(std_ecg[:5000],color='b',label='original data')
axes[0].set_xlabel('time')
axes[0].set_ylabel('ECG\'s value' )
axes[0].set_xlim(-250, 5250)
axes[0].set_ylim(-3, 3)
x = np.arange(4200,4400)
y1 = [-3]*len(x)
y2 = [3]*len(x)
axes[0].fill_between(x, y1, y2, facecolor='g', alpha=.3)

axes[1].plot(m_dist, color='r',label='Mahalanobis Distance')
axes[1].set_xlabel('time')
axes[1].set_ylabel('Mahalanobis Distance')
axes[1].set_xlim(-250, 5250)
axes[1].set_ylim(0, 1000)
y1 = [0]*len(x)
y2 = [1000]*len(x)
axes[1].fill_between(x, y1, y2, facecolor='g', alpha=.3)

plt.legend(fontsize=15)
plt.show()

image.png

It can be seen that the Mahalanobis distance increases in the green areas. In other words, it means that the abnormality could be detected (sensually).

Threshold setting 1

In the above, the abnormal value is detected by qualitative judgment, but it is necessary to set some threshold value in order to automatically detect the abnormal value.

The following facts are known about the Mahalanobis distance.

MDimensional normal distributionN({x} | {\mu}, {\Sigma})from,NIndependent specimens\left\\{ {x}_i \, | \, i=1,2,...,N \right\\}Based on the sample mean\hat{{\mu}}And the sample variance-covariance matrix\hat{{\Sigma}}To find. At this time, new observation values{x}^{\prime}For, the following holds.

  1. The statistic $ T ^ 2 $ defined by $ T ^ 2 = \ frac {NM} {(N + 1) M} a ({x} ^ {\ prime}) $ has $ (M) degrees of freedom. , NM) Follows the F distribution of $.
  2. In the case of $ N \ gg M $, $ a ({x} ^ {\ prime}) $ approximately follows a chi-square distribution with $ M $ of freedom.

See reference [2] for more information.

Believe in this fact and use it to set the threshold. In this case, $ M = 3 $, $ N = 10000 $, so it's safe to think of it as $ N \ gg M $.

Hereafter, it is assumed that $ a (e) $ follows a chi-square distribution with three degrees of freedom. First, check the chi-square distribution with 3 degrees of freedom.

from scipy import stats
x = np.linspace(0, 50, 2000)

plt.figure(figsize=(15,5))

plt.style.use('ggplot')

plt.plot(x, stats.chi2.pdf(x, 3), "b", label="chi2, 3")
plt.legend(fontsize=15)

plt.show()

image.png

Then, the upper 1% point is calculated.

a_99 = stats.chi2.ppf(0.99, 3)
print(a_99)
11.344866730144373

This is adopted as a threshold value, and when this value is exceeded, it is judged as an abnormal value.

plt.figure(figsize=(15,5))

plt.style.use('ggplot')

plt.plot(m_dist, color='r',label='Mahalanobis Distance')
plt.xlabel('time')
plt.ylabel('Mahalanobis Distance')
plt.xlim(-250, 5250)
plt.ylim(0, 300)

plt.plot(np.linspace(-250, 5250), [a_99] * len(np.linspace(-250, 5250)), 'b')

plt.legend(fontsize=15)

plt.show()

image.png

It's kind of subtle. It sounds like that, but it's inflexible.

Threshold setting 2

So I decided to try another method. The next method is to find the optimum threshold value according to the data.

It is assumed that the data held can be labeled as normal or abnormal, or has already been done.

The following content is based on reference [3].

For preparation, we will label with this data. Here, data from 0 to 5000 is targeted, and 4200 to 4400 is set as an abnormal value.

m_dist_th = np.zeros(4910)
m_dist_th[10:4910] = m_dist
TF_label = np.zeros(4910)
TF_label[4200:4400] = 1

By setting the threshold value, the following values are obtained. However, the right side represents the number of data satisfying the conditions.

\begin{align*}
    & TP = (\, Label:positive, \quad Prediction:positive \, )\\
    & FP = (\, Label:negative, \quad Prediction:positive \, )\\
    & FN = (\, Label:positive, \quad Prediction:negative \, )\\
    & TN = (\, Label:negative, \quad Prediction:negative \, )
\end{align*}

In addition, the following indicators will be introduced.

\begin{align*}
    & precision = \frac{TP}{TP + FP} \\
    & recall = \frac{TP}{TP + FN} \\
    & F_{\beta} = (1+\beta^2) \frac{precision \cdot recal}{(\beta^2 \cdot precision) + recall}
\end{align*}

Find the threshold that maximizes $ F_ {\ beta} $. First, define a function that returns $ F_ {\ beta} $.

def f_score(th, TF_label, m_dist_th, beta):
    TF_pred = np.where(m_dist_th > th, 1, 0)
    TF_pred = 2 * TF_pred
    
    PN = TF_label + TF_pred
    
    TN = np.count_nonzero(PN == 0)
    FN = np.count_nonzero(PN == 1)
    FP = np.count_nonzero(PN == 2)
    TP = np.count_nonzero(PN == 3)
    
    precision = TP/(TP + FP)
    recall = TP/(TP+FN)
    
    return (1+beta**2)*precision*recall/(beta**2*precision + recall)

Hereafter, $ \ beta = 0.1 $. Illustration of $ F_ {0.1} $ for thresholds 0-1000.

th_line = np.linspace(0, 1000, 5000)

dx = 0.2 

f_score_graph = []

for i in range(5000):
    f_score_graph.append(f_score(dx * i, TF_label, m_dist_th, 0.1))

plt.figure(figsize=(15,5))
plt.style.use('ggplot')
plt.plot(th_line, f_score_graph, color='g', label = "f_score")
plt.legend(fontsize=15)
plt.show()

image.png

Find the threshold that maximizes $ F_ {0.1} $.

from scipy import optimize

def minus_f_score(th, TF_label, m_dist_th, beta):
    return -f_score(th, TF_label, m_dist_th, beta)

ave_m_dist_th = np.average(m_dist_th)

opt_th = optimize.fmin_powell(minus_f_score, ave_m_dist_th, (TF_label, m_dist_th, 0.1))
print(opt_th)
Optimization terminated successfully.
         Current function value: -0.875333
         Iterations: 2
         Function evaluations: 86
309.19475611029304
plt.figure(figsize=(15,5))

plt.style.use('ggplot')

plt.plot(m_dist_th, color='r',label='Mahalanobis Distance')
plt.xlabel('time')
plt.ylabel('Mahalanobis Distance')
plt.xlim(-250, 5250)
plt.ylim(0, 1000)

plt.plot(np.linspace(-250, 5250), [opt_th] * len(np.linspace(-250, 5250)), 'b')

plt.legend(fontsize=15)

plt.show()

image.png

Good feeling.

References

[1] ReNom tutorial, anomaly detection of time series data by LSTM, https://www.renom.jp/ja/notebooks/tutorial/time_series/lstm-anomalydetection/notebook.html.

[2] Takeshi Ide, Introductory Anomaly Detection by Machine Learning -Practical Guide by R-, Corona Publishing Co., Ltd., 2015.

[3] P.Malhotra, L.Vig, G.Shroff and P.Agarwal, Long short term memory networks for anomaly detection in time series, Proceedings. Presses universitaires de Louvain, 2015.

Recommended Posts

Anomaly detection of time series data by LSTM (Keras)
Time series data anomaly detection for beginners
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
ECG data anomaly detection by Matrix Profile
Acquisition of time series data (daily) of stock prices
Smoothing of time series and waveform data 3 methods (smoothing)
View details of time series data with Remotte
Time series data prediction by AutoML (automatic machine learning)
Visualization of data by prefecture
Basic flow of anomaly detection
[Python] Plot time series data
A story about clustering time series data of foreign exchange
Anomaly detection by autoencoder using keras [Implementation example for beginners]
Calculation of time series customer loyalty
How to extract features of time series data with PySpark Basics
Python: Time Series Analysis: Preprocessing Time Series Data
Comparison of time series data predictions between SARIMA and Prophet models
About time series data and overfitting
10 selections of data extraction by pandas.DataFrame.query
Animation of geographic data by geopandas
LSTM (1) for time series forecasting (for beginners)
Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]
Plot CSV of time series data with unixtime value in Python (matplotlib)
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
Forecasting time series data with Simplex Projection
Predict time series data with neural network
Face detection by collecting images of Angers.
[Python] Accelerates loading of time series CSV
Deep Understanding Object Detection by Deep Learning by Keras
Time series analysis 4 Construction of SARIMA model
Anomaly detection using MNIST by Autoencoder (PyTorch)
Conversion of time data in 25 o'clock notation
The story of deciphering Keras' LSTM model.predict
How to handle time series data (implementation)
Reading OpenFOAM time series data and sets data
Kaggle Kernel Method Summary [Table Time Series Data]
Deep learning learned by implementation ~ Anomaly detection (unsupervised learning) ~
Parallel learning of deep learning by Keras and Kubernetes
How to read time series data in PyTorch
Sentiment analysis of large-scale tweet data by NLTK
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]