Introduction

This article was posted as Day 6 of Cisco Advent Calendar 2019 between Cisco Systems LLC.

Seven years have passed since the SuperVision team at the University of Toronto, Canada won the "Image Net Large Scale Visual Recognition Challenge (ILSVRC) 2012" in 2012. At this time, by using a model using deep learning, many modeling in machine learning and deep learning is performed in the third AI boom, and nowadays, it is common knowledge that AI is used in machine learning and deep learning. It has become.

However, according to a survey by the Ministry of Internal Affairs and Communications released in 2017, only 14.1% of companies have actually introduced AI solutions, and 22.8% are considering introducing AI solutions. The number of companies is 36.9% including the examination stage. (Quote: AI / IoT introduction status and schedule Ministry of Internal Affairs and Communications)

There are so many open source services related to machine learning that they are easy to use on a small scale. In the meantime, over the last two years, I've written an example of how to use machine learning easily. 2017 article: I tried to machine learn the location information obtained using API 2018 article: Collecting machine learning teacher data using collaboration tools

Therefore, this time, I would like to introduce an example that makes it easy to model machine learning using time-series data that will be acquired the most as data. Among them, we will focus on how machine learning is applied in the field of anomaly detection.

Data in chronological order

Time-series data refers to all data that is observed over time. And the order in which they are observed is significant. When it comes to time series data related to Cisco products, Syslog data is also included in the time series data, and Stealthwatch There are a wide variety of data such as Flow data of .html) and connection destination data of Cloudlock.

When applying time-series data to machine learning / deep learning, the following flow is often taken. In the method based on the similarity, how to extract the features, set the window size of the time series data (how to divide the time series set where the features are likely to appear), and how to model in the later machine learning method. The accuracy will vary greatly depending on the choice.

In the model-based method, the method used for metric time series analysis is first matched. The methods used for metric time series analysis here refer to hidden Markov models, ARMA models, VAR models, SARIMA models, and so on. It is a model that regards state transitions as changes over time and represents data transitions. Since each model has parameters in these methods, the obtained parameters are used and applied to the machine learning method.

Figure 1: Cisco Web Security Appliance Time Series Data スクリーンショット 2019-12-06 18.32.00.png

Figure 2: Cisco Cloudlock Time Series Data

This time, we aim to "find anomalous data by comparing with normal data" for time series data obtained as numerical data.

Anomaly detection

I recognize that anomaly detection is a very difficult field where mathematical formulas are lined up, such as statistics, probability theory, and optimization theory, but there are many fields that can be applied. --Discovery of abnormal signs at the factory --Discovering problems such as Malware in the security field --Discovery of health-related abnormalities using human body measurement data Various fields can be considered. Being able to detect anomalies in these fields has the advantage of being able to deal with problems more quickly than humans can detect them.

In the field of anomaly detection, many machine learning methods have been proposed for their application, but here are some of them.

I personally recognize that there are two important points in the field of anomaly detection. The idea is to emphasize accuracy (especially the idea of reducing False-Negative) and the idea of emphasizing the speed to detection. This is an important idea because it also affects how you actually operate it.

1. Concept that emphasizes the accuracy of anomaly detection

The idea of emphasizing accuracy is especially common in the field of failure detection. This is because we want to avoid `` `, which is actually broken, but is predicted to be" not broken ". In terms of machine learning, the idea is to increase recall.

(Reference) What is the recall rate? When considering accuracy, consider the following mixing matrix. Figure 3: Mixed matrix

And `` recall'' is the precision expressed by the following formula. In other words, the recall rate is an index of "how much the model can judge as abnormal when it is actually abnormal".

In reality, it is difficult to operate without human involvement in determining whether equipment such as factories is out of order. However, if the number of False-Negatives is large, it is expected that the operation will not change before and after the introduction due to the low trust in AI applications, and the burden on the administrator will not be reduced. To avoid this situation, AI applications that reduce the number of False-Negatives and reduce the burden of failure detection are expected.

2. The idea of emphasizing the speed of detection

When using machine learning / deep learning, you must also consider the calculation cost. No matter how accurate the model is, if it takes a long time to detect, there is no point in deploying an AI application.

For example, a security AI application. No matter how unusual traffic is detected, if it is detected the next day, it makes no sense after the important information has already been stolen. In other words, unless you can spend a lot of money on a base such as a server, you need to consider the calculation cost of the model and emphasize the speed of detection.

Machine learning methods with low computational costs include methods such as the naive Bayes classifier and the k-means / k-medoids method. I will not explain the detailed method here, but there are many other methods used for anomaly detection.

Application example

Here, I would like to consider an application example. The data to be used is assumed to be 2D time series data such as the number of flows in Stealthwatch. We assume the data because we want to generalize it as much as possible. スクリーンショット 2019-12-06 18.31.42.png Figure 4: Cisco Stealthwatch Time Series Data

From the above time series data, the data for the window size is extracted as a pattern and compared with the pattern in the normal state.

First of all, regarding pattern extraction, what we have to consider this time is that the number of flows rarely becomes 0, and it is considered that traffic is always flowing, especially at intervals of several minutes to several hours, which is the window size. is. If you know that the number will be 0, you can separate it when the number is 0, but if the number is not 0, you have to consider extracting the pattern.

If the window size is small, it is expected that the time zone and time width when the same pattern appears will differ depending on the situation. For example, we know that employees watch YouTube during lunch breaks, so even if we know that the number of flows will rise sharply between 12:00 and 13:00 and it will be past 13:00, we think that the maximum time will vary from day to day. (I don't know if it will take the maximum at 12:31 or the maximum at 12:36).

In this way, when "shapes are similar" regardless of the time width, the degree of similarity is calculated by a method called Dynamic Time Warping --DTW. With this method, it is possible to make a judgment even if the data lengths are not uniform. In other words, it is possible to judge even if the time width is different.

On the other hand, if the deviation with respect to the time axis is considered to be important, use the Euclidean distance. For example, the difference in the number of flows during the day and at night when the window size is large.

Now, let's actually implement DTW in Python. Dynamic programming is used to implement DTW. Also, when measuring the distance between two points, it is considered that there are cases where it is measured with an absolute value and cases where it is measured with an Euclidean distance, so it is separated by the argument of method. The first and second arguments represent two time-series data separated by window size. When comparing with time series data for several days, this function is called multiple times.

def dtw(wave_x, wave_y, method="abs"):
    d = np.zeros([len(wave_x)+1, len(wave_y)+1])
    d[:] = np.inf
    d[0, 0] = 0
    if method = "euclid":
        for i in range(1, d.shape[0]):
            for j in range(1, d.shape[1]):
                cost = np.sqrt((wave_x[i-1] - wave_y[j-1])**2)
                cost = (wave_x[i-1] - wave_y[j-1])
                row.append(cost)
                d[i, j] = cost + min(d[i-1, j], d[i, j-1], d[i-1, j-1])
    else:
        for i in range(1, d.shape[0]):
            for j in range(1, d.shape[1]):
                cost = np.abs(wave_x[i-1] - wave_y[j-1])
                row.append(cost)
                d[i, j] = cost + min(d[i-1, j], d[i, j-1], d[i-1, j-1])
    elapsed_time = time.time() - start_time
    return d[-1][-1], d, matrix

By calculating this DTW multiple times, the distance matrix between multiple time series data can be obtained. Using this distance matrix, consider the classification by the k-medoids method. This time, we set the number of clusters to 2 because we are aiming to classify two patterns of "normal / abnormal".

self.n_cluster = 2

The implementation of the k-medoids method is as follows. Since the k-medoids method is not implemented in scikit-learn, implement it as follows. Substitute the distance matrix for the D_matrix part.

class KMedoids():
    def __init__(self, max_iter=300):
        self.n_cluster = 2
        self.max_iter = max_iter

    def fit_predict(self, D_matrix):
        m, n = D_matrix.shape
        ini_medoids = np.random.choice(range(m), self.n_cluster, replace=False)
        tmp_D = D_matrix[:, ini_medoids]

        labels = np.argmin(tmp_D, axis=1)

        results = pd.DataFrame([range(m), labels]).T
        results.columns = ['id', 'label']

        col_names = ['x_' + str(i + 1) for i in range(m)]
        results = pd.concat([results, pd.DataFrame(D_matrix, columns=col_names)], axis=1)

        old_medoids = ini_medoids
        new_medoids = []

        loop = 0
        while ((len(set(old_medoids).intersection(set(new_medoids))) != self.n_cluster) 
               and (loop < self.max_iter) ):
        if loop > 0:
            old_medoids = new_medoids.copy()
            new_medoids = []
        for i in range(self.n_cluster):
            tmp = results[results['label'] == i].copy()
            tmp['distance'] = np.sum(tmp.loc[:, ['x_' + str(id + 1) for id in tmp['id']]].values, axis=1)
            tmp = tmp.reset_index(drop=True)
            new_medoids.append(tmp.loc[tmp['distance'].idxmin(), 'id'])

        new_medoids = sorted(new_medoids)
        tmp_D = D_matrix[:, new_medoids]

        clustaling_labels = np.argmin(tmp_D, axis=1)
        results['label'] = clustaling_labels
        loop += 1
        results = results.loc[:, ['id', 'label']]
        results['flag_medoid'] = 0

        for medoid in new_medoids:
            results.loc[results['id'] == medoid, 'flag_medoid'] = 1
        tmp_D = pd.DataFrame(tmp_D, columns=['medoid_distance'+str(i) for i in range(self.n_cluster)])
        results = pd.concat([results, tmp_D], axis=1)

        self.results = results
        self.cluster_centers_ = new_medoids
        return results['label'].values

It is possible to classify into two classes above, normal and abnormal. The details of the k-medoids method are omitted here, but the characteristics are not so different from the k-means method, but unlike the k-means method, medoids are calculated and classified, so they are resistant to outliers. In addition, it is possible to classify as long as the distance matrix is obtained, so the application is effective. Here's how to find the medoid.

Finally

This time, I wrote about anomaly detection methods and application methods. After that, a code example for performing anomaly detection using two-dimensional time series data is shown.

Since the k-medoids method can be classified as long as the distance matrix is obtained, even a character string can be classified by finding the distance. The distance of the character string uses the Jaro Winkler distance and the Levenshtein distance, so please search here as well.

We hope that you will refer to this article and try anomaly detection using the data you have at hand.

Disclaimer

The opinions expressed on this site and the corresponding comments are the personal opinions of the contributor and not the opinions of Cisco. The content of this site is provided for informational purposes only and is not intended to be endorsed or expressed by Cisco or any other party. By posting on this website, you are solely responsible for the content of all information uploaded by posting, linking or otherwise, and disclaiming Cisco from any liability regarding the use of this website. I agree.

Time series data anomaly detection for beginners