Continuing from the previous anomaly detection method (MSPC), this time we have organized the anomaly detection method by sparse structure learning.
Hotelling's $ T ^ 2 $ method and $ MSPC $, which are well-known anomaly detection methods, are methods used when monitoring data whose average value does not change. On the other hand, at the manufacturing site, not only such data, but also equipment and operation data with a data structure whose values change while maintaining a certain relationship between variables. At that time, one of the anomaly detection methods focusing on the relationships between variables (correlation, etc.) is the anomaly detection method by sparse structure learning.
Outline of method
For details of the method, see ([Abnormality detection and change point detection](https://www.amazon.co.jp/%E7%95%B0%E5%B8%B8%E6%A4%9C%E7%9F%A5] % E3% 81% A8% E5% A4% 89% E5% 8C% 96% E6% A4% 9C% E7% 9F% A5-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E4% BA% 95% E6% 89% 8B-% E5% If you read 89% 9B / dp / 4061529080)), you will deepen your understanding. I also read this book, studied it, and put it into practice in my actual work.
This time, I used a dataset called Water Treatment Plant from the UCI Machine Learning Repository for the sample. We used 100 normal data and the remaining 279 as evaluation data to see how much they deviated from the normal state.
The python code is below.
#Import required libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from scipy import stats 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors.kde import KernelDensity
from sklearn.covariance import GraphicalLassoCV
df = pd.read_csv('watertreatment_mod.csv', encoding='shift_jis', header=0, index_col=0)
df.head()
window_size = 10
cr = 0.99

The input data looks like the above.
Next, we will divide the training data and evaluation data and standardize them.
#Split point
split_point = 100
#Divided into training data (train) and evaluation data (test)
train_df = df.iloc[:(split_point-1),]
test_df = df.iloc[split_point:,]
#Standardize data
sc = StandardScaler()
sc.fit(train_df)
train_df_std = sc.transform(test_df)
test_df_std = sc.transform(test_df)
Next, we estimate the covariance.
#Covariance estimation
n_samples, n_features = train_df_std.shape
emp_cov = np.dot(train_df_std.T, train_df_std) / n_samples
model = GraphicalLassoCV()
model.fit(train_df_std)
cov_ = model.covariance_
prec_ = model.precision_
Next, define a function to find the KL distance.
def Calc_KL(cov_, prec_, xtest):
    """Calculate the KL distance
    Parameters
    ----------
    cov_ : np.array
Covariance matrix of training data
    prec_ : np.array
Accuracy matrix of training data
    df : pd.DataFrame
data set
    Returns
    -------
    d_ab : pd.DataFrame
KL distance
    """
    n_samples, n_features = xtest.shape
    
    d_abp=np.zeros(n_features)
    d_abm=np.zeros(n_features)
    d_ab=np.zeros(n_features)
    
    model_test = GraphicalLassoCV()
    try:
        model_test.fit(xtest)
    except FloatingPointError:
        print("floating error")
        return d_ab
    cov__test = model_test.covariance_
    prec__test = model_test.precision_  
    
    #Calculate the magnitude of correlation collapse for each variable
    for i in range(n_features):
        temp_prec_a = np.r_[prec_[i:n_features,:], prec_[0:i,:]] 
        temp_prec_a = np.c_[temp_prec_a[:,i:n_features], temp_prec_a[:,0:i]]
        temp_prec_b = np.r_[prec__test[i:n_features,:], prec__test[0:i,:]] 
        temp_prec_b = np.c_[temp_prec_b[:,i:n_features], temp_prec_b[:,0:i]]
        temp_cov_a = np.r_[cov_[i:n_features,:], cov_[0:i,:]] 
        temp_cov_a = np.c_[temp_cov_a[:,i:n_features], temp_cov_a[:,0:i]]
        temp_cov_b = np.r_[cov__test[i:n_features,:], cov__test[0:i,:]] 
        temp_cov_b = np.c_[temp_cov_b[:,i:n_features], temp_cov_b[:,0:i]]
        La = temp_prec_a[:-1, :-1]
        la = temp_prec_a[:-1, -1]
        lama = temp_prec_a[-1, -1]
        Wa = temp_cov_a[:-1, :-1]
        wa = temp_cov_a[:-1, -1]
        sigmaa = temp_cov_a[-1, -1]
        Lb = temp_prec_b[:-1, :-1]
        lb = temp_prec_b[:-1, -1]
        lamb = temp_prec_b[-1, -1]
        Wb = temp_cov_b[:-1, :-1]
        wb = temp_cov_b[:-1, -1]
        sigmab = temp_cov_b[-1, -1]
        
        d_abp[i] = np.dot(wa, la)+0.5*(np.dot(np.dot(lb, Wb), lb)-np.dot(np.dot(la, Wa), la))+0.5*(np.log(lama/lamb)+sigmaa-sigmab)
        d_abm[i] = np.dot(wb, lb)+0.5*(np.dot(np.dot(la, Wa), la)-np.dot(np.dot(lb, Wb), lb))+0.5*(np.log(lamb/lama)+sigmab-sigmaa)
        d_ab[i] = max(-d_abp[i], -d_abm[i])
        
    return d_ab
Next, prepare a function that calculates the management limit by kernel density estimation.
def cl_limit(x, cr=0.99):
    """Calculate control limits
    Parameters
    ----------
    x : np.array
KL distance
    cr : float
Boundaries of control limits
    Returns
    -------
    cl : float
Boundary point of management limit
    """
    X = x.reshape(np.shape(x)[0],1)
    bw= (np.max(X)-np.min(X))/100
    kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
    X_plot = np.linspace(np.min(X), np.max(X), 1000)[:, np.newaxis]
    log_dens = kde.score_samples(X_plot)
    prob = np.exp(log_dens) / np.sum(np.exp(log_dens))
    calprob = np.zeros(np.shape(prob)[0])
    calprob[0] = prob[0]
    for i in range(1,np.shape(prob)[0]):
        calprob[i]=calprob[i-1]+prob[i]
    
    cl = X_plot[np.min(np.where(calprob>cr))]
    
    return cl
Cross-validate the training data to calculate the management limit.
K = 5
cv_data_size = np.int(np.shape(train_df_std)[0]/5)
n_train_samples = np.shape(train_df_std)[0]
counter = 0
for i in range(K):
    
    cv_train_data=np.r_[train_df_std[0:i*cv_data_size,], train_df_std[(i+1)*cv_data_size:,]]
    if i < K-1:
        cv_test_data=train_df_std[i*cv_data_size:(i+1)*cv_data_size,]
    else:
        cv_test_data=train_df_std[i*cv_data_size:,]
    
    model_cv = GraphicalLassoCV()
    model_cv.fit(cv_train_data)
    cov__cv = model.covariance_
    prec__cv = model.precision_
    
    for n in range(window_size, np.shape(cv_test_data)[0]):
        count = i*cv_data_size + n
        tempX = cv_test_data[n-window_size:n,:]
        d_ab_temp = Calc_KL(cov__cv, prec__cv, tempX)      
        
        if 0 == counter:
            d_ab = d_ab_temp.reshape(1,n_features)
            TimeIndices2 = TimeIndices[count]
        else:
            d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
            #Here error
            TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[count]))
        
        counter = counter + 1
split_point = np.shape(d_ab)[0]    
d_ab_cv = d_ab[np.sum(d_ab,axis=1)!=0,:]
cl = np.zeros([n_features])
for i in range(n_features):
    cl[i] = cl_limit(d_ab_cv[:,i],cr)
Finally, the covariance of the evaluation data is estimated and visualized.
#Estimate covariance for evaluation data
n_test_samples = np.shape(test_df_std)[0]
for n in range(window_size, n_test_samples):    
    tempX = test_df_std[n-window_size:n,:]
    
    d_ab_temp = Calc_KL(cov_, prec_, tempX)
    d_ab = np.r_[d_ab,d_ab_temp.reshape(1,n_features)]
    TimeIndices2 = np.vstack((TimeIndices2,TimeIndices[n+n_train_samples]))
x2 = [0, np.shape(d_ab)[0]]
x3 = [split_point, split_point]
x = range(0, np.shape(TimeIndices2)[0],20)
NewTimeIndices = np.array(TimeIndices2[x])
for i in range(38):
    plt.figure(figsize=(200, 3))
    plt.subplot(1, 38, i+1)
    plt.title('%s Contribution' % (i))
    plt.plot(d_ab[:, i], marker="o")
    plt.xlabel("Time")
    plt.ylabel("Contribution") 
    plt.xticks(x,NewTimeIndices,rotation='vertical') 
    y2 = [cl[i],cl[i]]
    plt.plot(x2,y2,ls="-", color = "r")
    y3 = [0, np.nanmax(d_ab[:,i])]
    plt.plot(x3,y3,ls="--", color = "b")
plt.show()
Thank you for reading to the end. This time, we implemented anomaly detection by sparse structure learning.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts