Continuing from the previous anomaly detection method (MSPC), this time we have organized the anomaly detection method by sparse structure learning.
Hotelling's $ T ^ 2 $ method and $ MSPC $, which are well-known anomaly detection methods, are methods used when monitoring data whose average value does not change. On the other hand, at the manufacturing site, not only such data, but also equipment and operation data with a data structure whose values change while maintaining a certain relationship between variables. At that time, one of the anomaly detection methods focusing on the relationships between variables (correlation, etc.) is the anomaly detection method by sparse structure learning.
Outline of method
For details of the method, see ([Abnormality detection and change point detection](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5] % E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% If you read 89% 9B / dp / 4061529080)), you will deepen your understanding. I also read this book, studied it, and put it into practice in my actual work.
This time, I used a dataset called Water Treatment Plant from the UCI Machine Learning Repository for the sample. We used 100 normal data and the remaining 279 as evaluation data to see how much they deviated from the normal state.
The python code is below.
#Import required libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from scipy import stats
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.kde import KernelDensity
from sklearn.covariance import GraphicalLassoCV
df = pd.read_csv('watertreatment_mod.csv', encoding='shift_jis', header=0, index_col=0)
df.head()
window_size = 10
cr = 0.99
The input data looks like the above.
Next, we will divide the training data and evaluation data and standardize them.
#Split point
split_point = 100
#Divided into training data (train) and evaluation data (test)
train_df = df.iloc[:(split_point-1),]
test_df = df.iloc[split_point:,]
#Standardize data
sc = StandardScaler()
sc.fit(train_df)
train_df_std = sc.transform(test_df)
test_df_std = sc.transform(test_df)
Next, we estimate the covariance.
#Covariance estimation
n_samples, n_features = train_df_std.shape
emp_cov = np.dot(train_df_std.T, train_df_std) / n_samples
model = GraphicalLassoCV()
model.fit(train_df_std)
cov_ = model.covariance_
prec_ = model.precision_
Next, define a function to find the KL distance.
def Calc_KL(cov_, prec_, xtest):
"""Calculate the KL distance
Parameters
----------
cov_ : np.array
Covariance matrix of training data
prec_ : np.array
Accuracy matrix of training data
df : pd.DataFrame
data set
Returns
-------
d_ab : pd.DataFrame
KL distance
"""
n_samples, n_features = xtest.shape
d_abp=np.zeros(n_features)
d_abm=np.zeros(n_features)
d_ab=np.zeros(n_features)
model_test = GraphicalLassoCV()
try:
model_test.fit(xtest)
except FloatingPointError:
print("floating error")
return d_ab
cov__test = model_test.covariance_
prec__test = model_test.precision_
#Calculate the magnitude of correlation collapse for each variable
for i in range(n_features):
temp_prec_a = np.r_[prec_[i:n_features,:], prec_[0:i,:]]
temp_prec_a = np.c_[temp_prec_a[:,i:n_features], temp_prec_a[:,0:i]]
temp_prec_b = np.r_[prec__test[i:n_features,:], prec__test[0:i,:]]
temp_prec_b = np.c_[temp_prec_b[:,i:n_features], temp_prec_b[:,0:i]]
temp_cov_a = np.r_[cov_[i:n_features,:], cov_[0:i,:]]
temp_cov_a = np.c_[temp_cov_a[:,i:n_features], temp_cov_a[:,0:i]]
temp_cov_b = np.r_[cov__test[i:n_features,:], cov__test[0:i,:]]
temp_cov_b = np.c_[temp_cov_b[:,i:n_features], temp_cov_b[:,0:i]]
La = temp_prec_a[:-1, :-1]
la = temp_prec_a[:-1, -1]
lama = temp_prec_a[-1, -1]
Wa = temp_cov_a[:-1, :-1]
wa = temp_cov_a[:-1, -1]
sigmaa = temp_cov_a[-1, -1]
Lb = temp_prec_b[:-1, :-1]
lb = temp_prec_b[:-1, -1]
lamb = temp_prec_b[-1, -1]
Wb = temp_cov_b[:-1, :-1]
wb = temp_cov_b[:-1, -1]
sigmab = temp_cov_b[-1, -1]
d_abp[i] = np.dot(wa, la)+0.5*(np.dot(np.dot(lb, Wb), lb)-np.dot(np.dot(la, Wa), la))+0.5*(np.log(lama/lamb)+sigmaa-sigmab)
d_abm[i] = np.dot(wb, lb)+0.5*(np.dot(np.dot(la, Wa), la)-np.dot(np.dot(lb, Wb), lb))+0.5*(np.log(lamb/lama)+sigmab-sigmaa)
d_ab[i] = max(-d_abp[i], -d_abm[i])
return d_ab
Next, prepare a function that calculates the management limit by kernel density estimation.
def cl_limit(x, cr=0.99):
"""Calculate control limits
Parameters
----------
x : np.array
KL distance
cr : float
Boundaries of control limits
Returns
-------
cl : float
Boundary point of management limit
"""
X = x.reshape(np.shape(x)[0],1)
bw= (np.max(X)-np.min(X))/100
kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
X_plot = np.linspace(np.min(X), np.max(X), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
prob = np.exp(log_dens) / np.sum(np.exp(log_dens))
calprob = np.zeros(np.shape(prob)[0])
calprob[0] = prob[0]
for i in range(1,np.shape(prob)[0]):
calprob[i]=calprob[i-1]+prob[i]
cl = X_plot[np.min(np.where(calprob>cr))]
return cl
Cross-validate the training data to calculate the management limit.
K = 5
cv_data_size = np.int(np.shape(train_df_std)[0]/5)
n_train_samples = np.shape(train_df_std)[0]
counter = 0
for i in range(K):
cv_train_data=np.r_[train_df_std[0:i*cv_data_size,], train_df_std[(i+1)*cv_data_size:,]]
if i < K-1:
cv_test_data=train_df_std[i*cv_data_size:(i+1)*cv_data_size,]
else:
cv_test_data=train_df_std[i*cv_data_size:,]
model_cv = GraphicalLassoCV()
model_cv.fit(cv_train_data)
cov__cv = model.covariance_
prec__cv = model.precision_
for n in range(window_size, np.shape(cv_test_data)[0]):
count = i*cv_data_size + n
tempX = cv_test_data[n-window_size:n,:]
d_ab_temp = Calc_KL(cov__cv, prec__cv, tempX)
if 0 == counter:
d_ab = d_ab_temp.reshape(1,n_features)
TimeIndices2 = TimeIndices[count]
else:
d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
#Here error
TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[count]))
counter = counter + 1
split_point = np.shape(d_ab)[0]
d_ab_cv = d_ab[np.sum(d_ab,axis=1)!=0,:]
cl = np.zeros([n_features])
for i in range(n_features):
cl[i] = cl_limit(d_ab_cv[:,i],cr)
Finally, the covariance of the evaluation data is estimated and visualized.
#Estimate covariance for evaluation data
n_test_samples = np.shape(test_df_std)[0]
for n in range(window_size, n_test_samples):
tempX = test_df_std[n-window_size:n,:]
d_ab_temp = Calc_KL(cov_, prec_, tempX)
d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[n+n_train_samples]))
x2 = [0, np.shape(d_ab)[0]]
x3 = [split_point, split_point]
x = range(0, np.shape(TimeIndices2)[0],20)
NewTimeIndices = np.array(TimeIndices2[x])
for i in range(38):
plt.figure(figsize=(200, 3))
plt.subplot(1, 38, i+1)
plt.title('%s Contribution' % (i))
plt.plot(d_ab[:, i], marker="o")
plt.xlabel("Time")
plt.ylabel("Contribution")
plt.xticks(x,NewTimeIndices,rotation='vertical')
y2 = [cl[i],cl[i]]
plt.plot(x2,y2,ls="-", color = "r")
y3 = [0, np.nanmax(d_ab[:,i])]
plt.plot(x3,y3,ls="--", color = "b")
plt.show()
Thank you for reading to the end. This time, we implemented anomaly detection by sparse structure learning.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts