I will explain the abnormal value detection based on Mahalanobis distance </ b>. Click here for implementation
This is a method of learning a data pattern by unsupervised learning and detecting data that deviates significantly from this data as an abnormal value. Mahalanobis distance is one of the methods used in statistics to express distance. There is Euclidean distance </ b> as a method of expressing similar distances.
First of all, I would like to explain the Euclidean distance and review it. Euclidean distance is a method of expressing so-called "general" distance. You know that the Pythagorean theorem requires it. If you think in a simple two-dimensional plane,
{ \displaystyle
p = (p_1,p_2)
}
and,
{ \displaystyle
q = (q_1,q_2)
}
When, their distance is
{ \displaystyle
d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2}
}
You will be asked for it. This is a general distance calculation method.
Mahalanobis distance is one of the methods used in statistics to express distance, as explained earlier. When it comes to when it is used, it is often used when multidimensional data has a correlation </ b>.
Assuming that the data is an n-dimensional continuous vector, the obtained data string is
x^m = x_1,x_2,…x_m
(Think of them all as a set of vectors)
The i-th data is
{
x_i = \left(
\begin{array}{c}
x_i,_1 \\
x_i,_2 \\
\vdots \\
x_i,_n
\end{array}
\right)
}
When writing, its mean vector
μ
, Covariance matrix
\sum_{}
Is calculated as follows.
Mean vector
{
μ = \frac{1}{m} \sum_{i=1}^m x_i
}
Covariance matrix
\sum_{} = \frac{1}{m}\sum_{i=1}^m (x_i - μ)(x_i - μ)^{\mathrm{T}}
Therefore, with θ as the threshold parameter, for the new data x,
θ < \sqrt{(x_i - μ) ^{\mathrm{T}}\sum{}^{-1} (x_i - μ) }
If is satisfied, x is judged to be an abnormal value. The right side of this equation is called the Mahalanobis distance.
here,
\sum_{}
When is an identity matrix, the square of the Euclidean distance is obtained. When it is a diagonal matrix </ b>, the scale for each dimension is different. If it is a off-diagonal matrix </ b>, it will be further rotated.
Shown below is an image of outlier detection based on the Mahalanobis distance on a two-dimensional vector.
This is because the plotted points are data points, and an ellipse is displayed so as to cover the data points.
Points not included in this ellipse are outliers. Therefore, the threshold value θ indicates the size of this ellipse, and the proportion of outliers changes depending on how much is allowed.
It is easy to understand if you look at this figure. The difference between the Euclidean distance and the Mahalanobis distance lies in whether or not the correlation of multidimensional data is taken into account when defining the distance. In the Mahalanobis distance, the direction of the distance with strong correlation is the idea that it is relatively shorter than the actual distance. In the case of Euclidean distance, the distance is the same on both the x-axis and y-axis. However, the Mahalanobis distance is defined by dispersion on both the x-axis and y-axis. In other words, if the variance is large, it can be interpreted that the distance from the origin is not so far </ b>. If the variance is small, the distance from the origin is large </ b>.
In this figure, multiple ellipses are drawn from the origin. The size of this ellipse indicates the size of the threshold of outliers. The larger the threshold, the larger the ellipse and the smaller the proportion of outliers.
I mentioned that the shape of the ellipse is determined by the variance, but you can interpret this figure as an image like the contour lines of a mountain. The contour lines enclose the mountain parts at the same height. The Mahalanobis distance is the same. If you interpret that the threshold value is in the range surrounded by the same ellipse without considering the concept of axis, I think that you can understand how the circle is distorted.
In the Mahalanobis distance detection method, the concept of outliers is formulated through basic concepts such as mean vector and fraction. However, since the mean value itself is greatly affected by the outliers, a method of detecting the outliers using the idea of the median has also been proposed. (Omitted here)
Also, personally important, the distance calculation method using this Mahalanobis distance cannot use data with only boolean values. </ b> (The method is to subtract the logical value of 1 or 0 from the average value of 1 or 0, so think about what will happen ...)
Next time, I will actually perform Implementation with python because it is an implementation edition.