By the way, as a continuation of the previous theory, this time I will actually implement the Mahalanobis distance in python. Click here for the previous theory Abnormal value detection by unsupervised learning: Mahalanobis distance (theory)
Python's scikit-learn library implements a function that calculates the Mahalanobis distance. Robust covariance estimation and Mahalanobis distances relevance
However, even if I write an article that says "I will execute this code!", It will be just an English translation, so This time, I will implement it myself and see how the calculation process is on the program.
Finally, after connecting all the codes, the goal is to be able to calculate the distance by yourself.
Also, assume that the files handled in this experiment have the following format. Suppose Column is the machine number and Row is the machine score. (It seems that Ichamon will fly in the opposite direction, but of course the opposite is not a problem.)
test.csv
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|
a | 0 | 6 | 7 | 2 | 3 | 3 | 1 | 0 | 0 | 1 |
b | 1 | 1 | 11 | 6 | 0 | 2 | 1 | 4 | 1 | 2 |
c | 2 | 12 | 32 | 5 | 0 | 1 | 3 | 4 | 1 | 1 |
d | 3 | 3 | 7 | 3 | 2 | 2 | 2 | 1 | 2 | 5 |
e | 4 | 6 | 6 | 3 | 5 | 1 | 1 | 1 | 1 | 3 |
f | 5 | 7 | 9 | 5 | 0 | 2 | 2 | 1 | 1 | 2 |
With this format, it seems that the score is extremely high only for the third machine number.
With Mahalanobis distance, you can set a certain threshold value yourself and judge a record with a value that exceeds the threshold value as an abnormal value. Note that you have to set the threshold yourself each time. </ b> This time, we will visualize abnormal data </ b> by showing it as a graph without setting a threshold value on this side.
import The function to import is as follows.
These installation procedures are omitted this time. For installation, basically it is better to use pip or easy_install.
# -*- coding: utf-8 -*-
import numpy as np
import scipy as sc
from scipy import linalg
from scipy import spatial
import scipy.spatial.distance
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager
import pylab
This time, there are 6 rows for ROW and 10 rows for COLUMN, so define as follows. Also, since the test.csv mentioned earlier is read, read this as well.
ROW = 10
COLUMN = 6
csv = pd.read_csv('test.csv')
At the same time, the variables to be handled in the future are also defined here first.
# row:line,column:Column,ave:average,vcm:分散共分散lineColumn
row = []
column = []
ave = [0.0 for i in range(ROW)]
vcm = np.zeros((COLUMN, ROW, ROW))
diff = np.zeros((1, ROW))
mahal = np.zeros(COLUMN)
tmp = np.zeros(ROW)
In general data, there are not many cases where all the data is filled like this test.csv
.
In such a case, I delete the missing value as follows.
#Delete missing data
# axis =1 deletes columns with missing values
trans_data = csv.dropna(axis=1)
print(trans_data)
If you print this,
Unnamed: 0 1 2 3 4 5 6 7 8 9 10
0 a 0 6 7 2 3 3 1 0 0 1
1 b 1 1 11 6 0 2 1 4 1 2
2 c 2 12 32 5 0 1 3 4 1 1
3 d 3 3 7 3 2 2 2 1 2 5
4 e 4 6 6 3 5 1 1 1 1 3
5 f 5 7 9 5 0 2 2 1 1 2
It will be output like this.
In addition, the data formatted in this way is converted to list format and stored in the row and column declared earlier.
#trans to row_Concatenate elements of data in list format
for i in range(ROW):
row.append(list(trans_data.ix[i]))
print(row)
#Concatenate columns
for i in range(1, COLUMN+1):
column.append(list(trans_data.ix[:, i]))
print(column)
#row calculation result
[['a', 0, 6, 7, 2, 3, 3, 1, 0, 0, 1], ['b', 1, 1, 11, 6, 0, 2, 1, 4, 1, 2], ['c', 2, 12, 32, 5, 0, 1, 3, 4, 1, 1], ['d', 3, 3, 7, 3, 2, 2, 2, 1, 2, 5], ['e', 4, 6, 6, 3, 5, 1, 1, 1, 1, 3], ['f', 5, 7, 9, 5, 0, 2, 2, 1, 1, 2]]
#column calculation result
[[0, 1, 2, 3, 4, 5], [6, 1, 12, 3, 6, 7], [7, 11, 32, 7, 6, 9], [2, 6, 5, 3, 3, 5], [3, 0, 0, 2, 5, 0], [3, 2, 1, 2, 1, 2], [1, 1, 3, 2, 1, 2], [0, 4, 4, 1, 1, 1], [0, 1, 1, 2, 1, 1], [1, 2, 1, 5, 3, 2]]
It turned out that it is stored line by line as a multidimensional array.
I explained that the mean vector can be expressed by the following formula.
{
μ = \frac{1}{m} \sum_{i=1}^m x_i
}
If you actually write this in python, it will be as follows.
#Calculation of mean value
for i in range(ROW):
#A technique called slicing
ave[i] = np.average(row[i][1:len(row[i])])
print(ave)
Unnamed: 0 1 2 3 4 5 6 7 8 9 10
0 a 0 6 7 2 3 3 1 0 0 1
1 b 1 1 11 6 0 2 1 4 1 2
2 c 2 12 32 5 0 1 3 4 1 1
3 d 3 3 7 3 2 2 2 1 2 5
4 e 4 6 6 3 5 1 1 1 1 3
5 f 5 7 9 5 0 2 2 1 1 2
[2.2999999999999998, 2.8999999999999999, 6.0999999999999996, 3.0, 3.1000000000000001, 3.3999999999999999]
Numerical values are displayed for row, such as the average of the values in row a and the average of the values in row b.
As I mentioned in the previous article, the covariance matrix is also
\sum_{} = \frac{1}{m}\sum_{i=1}^m (x_i - μ)(x_i - μ)^{\mathrm{T}}
It can be calculated in this way.
#Since it uses Numpy methods, array()Converted the list with.
column = np.array([column])
ave = np.array(ave)
#Find the covariance matrix
# np.swapaxes()You can convert the axis with.
for i in range(COLUMN):
diff = (column[0][i] - ave)
diff = np.array([diff])
vcm[i] = (diff * np.swapaxes(diff, 0, 1)) / COLUMN
print(vcm)
The diff part is the difference, and mathematically
{(x_i - μ)}
It points to this part. Calculate this difference for each column and find the matrix for each column.
Unnamed: 0 1 2 3 4 5 6 7 8 9 10
0 a 0 6 7 2 3 3 1 0 0 1
1 b 1 1 11 6 0 2 1 4 1 2
2 c 2 12 32 5 0 1 3 4 1 1
3 d 3 3 7 3 2 2 2 1 2 5
4 e 4 6 6 3 5 1 1 1 1 3
5 f 5 7 9 5 0 2 2 1 1 2
#vcm column 1st column matrix
[[[ 5.29000000e-01 4.37000000e-01 9.43000000e-01 -0.00000000e+00
-2.07000000e-01 -3.68000000e-01]
[ 4.37000000e-01 3.61000000e-01 7.79000000e-01 -0.00000000e+00
-1.71000000e-01 -3.04000000e-01]
[ 9.43000000e-01 7.79000000e-01 1.68100000e+00 -0.00000000e+00
-3.69000000e-01 -6.56000000e-01]
[ -0.00000000e+00 -0.00000000e+00 -0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ -2.07000000e-01 -1.71000000e-01 -3.69000000e-01 0.00000000e+00
8.10000000e-02 1.44000000e-01]
[ -3.68000000e-01 -3.04000000e-01 -6.56000000e-01 0.00000000e+00
1.44000000e-01 2.56000000e-01]]
#vcm second row
[[ 1.36900000e+00 -7.03000000e-01 2.18300000e+00 0.00000000e+00
1.07300000e+00 1.33200000e+00]
[ -7.03000000e-01 3.61000000e-01 -1.12100000e+00 -0.00000000e+00
-5.51000000e-01 -6.84000000e-01]
[ 2.18300000e+00 -1.12100000e+00 3.48100000e+00 0.00000000e+00
1.71100000e+00 2.12400000e+00]
[ 0.00000000e+00 -0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 1.07300000e+00 -5.51000000e-01 1.71100000e+00 0.00000000e+00
8.41000000e-01 1.04400000e+00]
[ 1.33200000e+00 -6.84000000e-01 2.12400000e+00 0.00000000e+00
1.04400000e+00 1.29600000e+00]]
…
#vcm 10th row
[[ 1.69000000e-01 1.17000000e-01 6.63000000e-01 -2.60000000e-01
1.30000000e-02 1.82000000e-01]
[ 1.17000000e-01 8.10000000e-02 4.59000000e-01 -1.80000000e-01
9.00000000e-03 1.26000000e-01]
[ 6.63000000e-01 4.59000000e-01 2.60100000e+00 -1.02000000e+00
5.10000000e-02 7.14000000e-01]
[ -2.60000000e-01 -1.80000000e-01 -1.02000000e+00 4.00000000e-01
-2.00000000e-02 -2.80000000e-01]
[ 1.30000000e-02 9.00000000e-03 5.10000000e-02 -2.00000000e-02
1.00000000e-03 1.40000000e-02]
[ 1.82000000e-01 1.26000000e-01 7.14000000e-01 -2.80000000e-01
1.40000000e-02 1.96000000e-01]]]
The Mahalanobis distance can be calculated using the following formula.
θ < \sqrt{(x_i - μ) ^{\mathrm{T}}\sum{}^{-1} (x_i - μ) }
#Find mahalanobis distance
for i in range(COLUMN):
#Generate a general inverse matrix and multiply it by turning value for convenience of calculation
vcm[i] = sc.linalg.pinv(vcm[i])
vcm[i] = vcm[i].transpose()
vcm[i] = np.identity(ROW)
#Generation of difference vector
diff = (column[0][i] - ave)
for j in range(ROW):
tmp[j] = np.dot(diff, vcm[i][j])
mahal[i] = np.dot(tmp, diff)
#Mahalanobis distance
[ 5.39258751 8.5720476 28.53559181 3.67151195 7.89176786
5.88897275 4.72016949 5.00799361 6.7882251 5.8719673 ]
Now you can calculate the distance for the records in each column!
Let's plot the Mahalanobis distance obtained in this way on a graph.
plot = pylab.arange(0.0, ROW, 1.0)
mahal = np.sqrt(mahal)
print("Mahalanobis distance")
print(mahal)
plt.bar(range(COLUMN),mahal)
plt.title("")
plt.xlabel("x")
plt.ylabel("y")
plt.savefig("plot1.png ")
Unnamed: 0 1 2 3 4 5 6 7 8 9 10
0 a 0 6 7 2 3 3 1 0 0 1
1 b 1 1 11 6 0 2 1 4 1 2
2 c 2 12 32 5 0 1 3 4 1 1
3 d 3 3 7 3 2 2 2 1 2 5
4 e 4 6 6 3 5 1 1 1 1 3
5 f 5 7 9 5 0 2 2 1 1 2
#Mahalanobis distance
[ 5.39258751 8.5720476 28.53559181 3.67151195 7.89176786
5.88897275 4.72016949 5.00799361 6.7882251 5.8719673 ]
When I plotted it, I found that the records in the third column were far apart compared to the other elements. Now you can see that the data is out of order visually.
What did you think. This time, based on Theory, I actually calculated the Mahalanobis distance using python. The program used this time has been released to gist on GitHub.
Since it was implemented in full scratch, there may be a flaw in the code, so if you need precise calculations, [Robust covariance estimation and Mahalanobis distances relevance](http://scikit-learn.org/stable/auto_examples/ covariance / plot_mahalanobis_distances.html) We strongly recommend that you use the library with reference to this document.
We would appreciate it if you could contact us if you have any deficiencies or mistakes.
Recommended Posts