Aidemy 2020/11/10
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post in the introduction to anomaly detection. Nice to meet you.
What to learn this time ・ Abnormality detection by k-nearest neighbor method ・ Abnormality detection by 1-class SVM ・ Abnormality detection for direction data
(I learned the k-nearest neighbor method before, but this time I will learn it again according to the anomaly detection) -The __k-nearest neighbor method __ means "calculate the degree of anomaly from the ratio of anomalous data among __k data that are close to the data to be discriminated". As with the hoteling method, whether or not it is abnormal is determined by whether or not the __threshold value is exceeded. -The k-nearest neighbor method in which k is 1 at this time is called __ "nearest neighbor method" __. k = 1 means __ that only the closest data is referenced __. -In the previous hoteling method, the preconditions were that "the data originated from a single normal distribution" and "the data contained almost no outliers". However, this __k-nearest neighbor method does not have such a restriction __. ・ The k-nearest neighbor method is very easy to understand when illustrated. When judging whether the green dot in the figure below is normal or abnormal, this data is __ "normal" __ because it is close to the blue set, which is a set of normal data.
-The advantages of the k-nearest neighbor method are __ "No preconditions for data distribution are required", "Can be used even if normal data is made up of multiple sets", and "Easy formula for anomaly" __ .. ・ On the other hand, the disadvantages are __ "Since there are no prerequisites for data distribution, the threshold cannot be calculated by mathematical formulas" "Strict tuning of parameter k requires complicated mathematical formulas" "Because of lazy learning without creating a model in advance" , It takes a lot of time to classify new data. "__
-The degree of abnormality of the k-nearest neighbor method is calculated by the following formula.
・ About each variable ・ $ \ Pi_0 $ is __ Percentage of normal data in all samples __, $ \ pi_1 $ is __ Percentage of abnormal data in all samples __ · $ N ^ 0 (x) $ is the percentage of normal data in the $ k $ neighborhood of __ $ x $ __, $ N ^ 1 (x) $ is the anomaly in the $ k $ neighborhood of __ $ x $ Data ratio __ -$ \ Ln $ is the natural logarithm $ e $ based __ logarithm __ ($ \ log_e $)
・ Calculate each of these values. Assuming that the data label is "y", $ \ pi_0 $ is calculated by __ "y [y == 0] .size / y.size" __. That is, __ "label of normal data / label of all data" __ is $ \ pi_0 $. -Also, $ N ^ 0 (x) $ can be calculated by first finding the label "neighbors" in the neighborhood as shown below and then using __ "neighbors [neighbors == 0] .size / k" __.
・ Calculation of neighbors
-Calculate the degree of abnormality using the variables calculated so far. Since the code is written according to the above formula, it can be expressed as follows.
np.log((N1*pi0)/(N0*pi1))
-The method of learning N-1 out of N data and using the remaining one as test data to check the accuracy of the model is called __ "one-less crossing confirmation method" __. This time, we will use this to evaluate the accuracy. -The one-less intersection confirmation method can be implemented with KNeighborsClassifier. -In terms of code, the data is trained as in the previous section to calculate "neighbors" and the degree of anomaly is calculated, but __ "increase k" and "exclude the nearest neighbor" _ Two processes called _ are added. -For "increase k by one", "n_neighbors" passed to KNeighborsClassifier () should be __ "k + 1" __. -For "Exclude nearest neighbors", when calculating "neighbors", __ [:, 1:] __ should be used to exclude the first column of the __ array __.
·code
-As mentioned above, the __threshold is not calculated, so you need to set it yourself. Also, __neighborhood point k must be set by yourself __, so create a method to find these two __optimal values __. -The accuracy of the model (__F value __) can be checked to see if it is the optimum value. Prepare k candidates (param_k) and threshold candidates (param_ath) in a list in advance, take them out with a __for loop, calculate the outliers in the previous section, compare them with the thresholds, and if the accuracy is maximum. It is done by the method of updating. ・ For the calculation of accuracy, use __ "One-less intersection confirmation method" __ in the previous section. In addition, as a result of comparing the degree of abnormality and the threshold value, __ creates an array in which those judged as "abnormal" are set to "1" and the others are set to "0". An array that classifies 0 and 1 under such conditions can be created with __ "np.as array (conditional expression)" __. -Accuracy (score) can be achieved by calculating the F value with __ "f1_score (y, y_valid)" __.
-In the following, k candidates are implemented as "1 to 5", and threshold candidates are implemented as "0 as the median and 21 in 0.1 increments". The median value of __0 can be used as the threshold candidate, but it is difficult to determine the k candidate exactly as described above. Therefore, this time, the candidates are listed as above.
・ Code![Screenshot 2020-11-05 11.33.41.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/e84371c0-4c72-8704- 82ba-cc982ca1fe60.png)
-As was done in the previous section, when determining whether or not it is abnormal by the K-nearest neighbor method, if the conditional expression "abnormal value is larger than the threshold value" is set in the argument of __ "np.asarray ()" __, "0 Is normal, and "1" is abnormal. -The following code is a graph showing the degree of anomaly of x calculated using it after finding the optimum k and threshold value by the processing in the previous section.
-Code![Screenshot 2020-11-05 12.56.18.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ff93cbbf-f366-54a1- d1b7-a6ca66a5904f.png)
・ Graph (red is abnormal, blue is normal, light color indicates training data classification)
-1 Class SVM is a method to detect outliers in __ "data without label" __. That is, __unsupervised learning __. -The basic idea of outlier detection such as the hoteling method is __ "Create a sphere that encloses almost all of the data, and use the outliers that deviate from it as outliers" __. In the hoteling method, this was done with the __Mahalanobis distance , but in the 1-class SVM, outliers are passed through " distorting the data space to make __ all data into the sphere". Is detected. -When creating this sphere, the outermost data will exist on the sphere, and the sphere will be created based on this. For this reason, the data outside this is called __ "support vector data" __. In 1-class SVM, normal and abnormal are separated by asking for this.
-The flow of 1-class SVM is as follows. (1) Create an SVM classifier and calculate __abnormality __ ② Set the rate of abnormality and determine threshold ③ __ Abnormal judgment __ from threshold value and abnormal value
-First, create an SVM classifier. This can be created with __ "OneClass SVM ()" __. It is necessary to set __ "kernel" "gamma" __ as a parameter. -"Kernel" specifies __ how to distort the space __. Basically, specify __ "'rbf'" __. And when you specify'rbf', you can specify __ "'auto'" __ in "gamma". -In the following, the degree of abnormality will be calculated using this SVM classifier. Calculation can be done by passing data to __ "clf.decision_function ()" __. Also, the calculated data (multidimensional data) is transformed into one dimension using __ "ravel ()" __. -As for the degree of abnormality calculated by 1-class SVM, the smaller the __ value, the more abnormal __.
-Code![Screenshot 2020-11-05 14.17.10.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/82932f52-f73d-95db- 672c-60abe9b7e5ac.png)
-Next, set threshold. However, in 1-class SVM, it is necessary to set the threshold value on the assumption that __ "the data has abnormal data at a certain rate" __. In the following, this ratio is represented by __ "a (%)" __. -The threshold is calculated by __ "st.scoreatpercentile ()" __. Pass the __abnormality __ created in the previous section to the first argument, and pass the __abnormal data ratio "a" __ to the second argument. As a result, the threshold is set so that __abnormal data is a% of the total __. -In the following code, the rate of abnormality is set to "10,25,50,75 (%)" and each threshold value is output.
-Finally, the abnormality may be determined by comparing the abnormal value with the threshold value. The execution method may be the same as the K-nearest neighbor method. (This time, it is judged by __ (True, False) __ instead of (0,1))
・ Code![Screenshot 2020-11-05 15.08.06.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/aaf00f76-6b78-4f2a- 28be-20700317f681.png)
・ Graph (blue is abnormal data (10% of the total)) .com / 0/698700/4228ca8b-d9e8-185c-9ea4-e78f516da00d.png)
-For example, for certain document data, 10-word data and 100-word data may not be classified well because the size of the data as a vector is different. In such a case, this problem is solved by unifying the size of __data to 1. This is called __ "normalized" __, and the standardized vector data is called __ "direction data" __. ・ This time, anomaly detection is performed on this direction data. In this case as well, the same as for 1-class SVM, __ unlabeled data is handled __, __ "Data is distributed around a single direction" "Almost no abnormal values are included" __ There are restrictions. ・ The flow is as follows. ① Data __ standardization __ ② Data __abnormality calculation __ ③ Set __threshold value from sample distribution __ ④ __ Abnormality judgment __
-To set the size (length) of the data to 1, it is obtained by __ "divide the data by the length of the data itself" __. Therefore, first of all, it is necessary to find the __ data length __. -The length of the data is calculated by __ "np.linalg.norm (data, ord = 2, axis = 1) [:, np.new axis]" __. This was also done by regularizing deep learning (processing to prevent overfitting). __ "ord" __ specifies the number of dimensions __ (two dimensions this time), __ "axis" __ specifies 1, that is, __ to calculate the length in the row direction __ .. The __ [:, np.newaxis] __ part is adjusted to align the axis with the data when it is standardized __.
・ Code![Screenshot 2020-11-05 16.10.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/cdbb90ea-2d7a-2be0- 17ae-1d767e6427fa.png)
・ Result (all lengths are 1)![Screenshot 2020-11-05 16.10.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com /0/698700/66dc9f45-1f73-5b91-7311-fbd0bca54d54.png)
-Since the direction data is assumed to be distributed around a single direction, the __abnormality is determined by __ how much it deviates from this direction. To specify this deviation, it is necessary to determine the __direction reference __. This time, we use mean value in the standardized direction. -For the average of the standardized data, just pass the standardized data in the previous section to __ "np.mean ()" __. At this time, specify with __ "axis = 0" __. However, note that this average __ needs to be standardized again __. -The degree of anomaly can be calculated by __ "1- (inner product of data and averaged vector)" __. The inner product of vectors can be obtained by "np.dot ()", but since the data this time is a set of vectors, that is, __ "matrix" __, in such a case __ "np.matmul" ()"__use. Usage is the same as dot.
·code
-There is a theorem that the degree of anomaly follows __ "chi-square distribution" __. Therefore, the distribution of data is obtained using the degree of abnormality calculated in the previous section, and the threshold value is calculated from it. ・ See Chapter 1 for how to use the chi-square distribution. As a review, implementation is done with __ "st.chi2.ppf ()" __. Pass __ "1-False alarm rate" __ as the first argument. This false alarm rate is set by yourself. -In addition, this time, set __ "df" __ as the second argument and __ "scale" __ as the third argument. Set __degree of freedom __ in "df". The degree of freedom is calculated by __ "2 * mean square of anomaly / (mean square of anomaly-square of average of anomaly)" __. Also, what is set to "scale" is the status of the chi-square distribution, which can be calculated by __ "root mean square of anomaly / degree of freedom" __.
·code
-Finally, the data abnormality is judged. Here, __observation data "X" prepared separately from the sample data so far is used to determine the abnormality __. -Since it is necessary to compare the threshold value created in the previous section with the degree of abnormality of X for abnormality judgment, first calculate the degree of abnormality of __X __. -For the degree of abnormality of X, pass "X" and "mean value of sample data" to __ "np.matmul ()" __ to obtain the inner product. ・ When this is done, classify whether it is abnormal or normal by __ "np.as array (conditional expression)" __.
・ Code![Screenshot 2020-11-05 17.59.14.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/3eb5d367-3875-ce4c- efe1-7cb3bf516a1f.png)
-Abnormality detection using the __k-nearest neighbor method __ is determined to be abnormal when __ "ratio of abnormal data in k data" __ exceeds the threshold value. The k-nearest neighbor method does not have __constraints like the hoteling method, so it can be used for various data. -As one method of accuracy evaluation, there is __ "one-less crossing method" __. This is to use one of the __ data __ as test data for accuracy evaluation. -Also, the method used to detect anomalies in data __ without a __ label is __ "1 class SVM" __. Anomaly detection is performed by calculating the degree of anomaly using SVM, setting __ what percentage of the anomaly is in advance, and finding the threshold value from it. -Unlabeled abnormality detection can be performed for standardized vector data, that is, direction data. The degree of anomaly can be calculated by taking the __inner product of the data and the reference direction data __, and the threshold is set by using the __χ-square distribution __.
This time is over. Thank you for reading until the end.
Recommended Posts