I like history books and read them occasionally. Recently, I read Essence of Failure. This book was very interesting because it talked about what was happening at the turning point of the Pacific War from the perspective of organizational theory. While reading this book, I wondered, "Is the turning point something that can be looked back and quantitatively grasped?" Then there was. An algorithm that extracts changes in time series.
In the first place, what kind of problem is there in anomaly detection / change detection? Introduction to Anomaly Detection I tried to classify into 3 with reference to page 12 of.
task | Overview |
---|---|
Outlier detection problem | static. Is a data point significantly out of the distribution? |
Change point detection problem | dynamic. Whether or not the subject has changed, where is the point |
Abnormal state detection | dynamic. Is the current state of the target abnormal or normal? |
If you start writing more concretely, it will be a super long sentence, so I will skip it this time. For those who want to know the outline of anomaly detection / change detection, the article Summary of anomaly detection and change detection, no formula is recommended.
This time, about "change point detection" in this, we introduced the algorithm used for change point detection of time series data and actually tried it because there is a module in python.
The idea that has been advocated in the past is very simple. (1) Time series model (AR model, etc.) constructed using the entire time series data (2) Time-series model constructed by dividing at a certain point t Prepare both
(1) Calculate future forecast values using each other's models (2) Calculate the squared error with the actual value by comparing the predicted value and the measured value with each other's model. And
From "Square error of time series model constructed using the whole time series data" Subtract "Square error of time series model constructed by dividing at a certain point t" If it gets big enough It is defined as "a change is occurring at t".
For reference, only the formula of the AR model is described. Please check other documents and websites for detailed explanations. The d-dimensional AR model can be written as follows.
z_t = \sum_{i=1}^{k}A_iz_{t-i}+\varepsilon
$ A_i $ is the coefficient matrix, $ \ varepsilon $ is the mean 0, and the variance-covariance matrix is $ \ Sigma $.
The above algorithm had the following problems.
--Every time you predict the future t + 1 time point, you need to recalculate the new parameters using the data up to the t point. ――In addition, it is necessary to search for point t exploratory, and the amount of calculation becomes enormous.
--Since the time series model specified here is premised on applying to stationary data, it cannot handle non-stationary data. --It is difficult to make the above assumptions because we are dealing with data that is changing.
In order to solve the above problems, a method called "Change Finder" was created using the SDAR (Sequentially Discounting Auto Regressive) algorithm, which performs online processing of parameters and other calculations for building an AR model.
(1) Calculate the change score for each time series data, and regard the high score part as "change" (2) Parameter calculation is simply calculated by the maximum likelihood estimation method (a little application). (3) The parameters when the data is updated are obtained from the past parameters and the updated data. --By (2) and (3), the amount of calculation is reduced from the conventional method. (4) When calculating the model parameters, set "forgetting parameters" that reduce the influence of past data. ――By setting the above appropriately, even if the properties of past time series data are different, the influence can be reduced, and it becomes possible to deal with non-stationary data to some extent. ④ Extract by distinguishing outliers and change points --In order to distinguish the above, learning is carried out in two stages (described later).
In short, it means "faster and more accurate than before!"
In the SDAR model, the probability density function based on the AR model is obtained from the time series data, and the log-likelihood function is approximated (the log-likelihood function is omitted in this article. If you want to know, click here](http: Please check //www.viewcom.or.jp/seika/report0703.pdf).) For the model parameters, estimate the value when the value obtained by multiplying the obtained log-likelihood function by the oblivion coefficient is maximized. To do.
Specifically, it is as follows. To consider a conditional density function of $ x_t $ that gives $ x ^ {t-1} = x_1, x_2,…, x_ {t-1} $ for the stochastic process $ p $, $ p (x_t | When using a notation such as x ^ {t-1}) $, the parameters $ A $, $ \ mu $, $ \ Sigma $ of the SDAR model use the value when $ I $ is the maximum. ..
I = \sum_{i=1}^{t}(1-r)^{t-i}\log P(x_i|x^{i-1},A_1,…,A_k,\mu,\Sigma)
Where $ r $ is the oblivion parameter, $ A $ is the AR model parameter, $ k $ is the AR model order, and $ \ Sigma $ is the covariance matrix.
Also, in this article, we will omit the derivation of the log-likelihood function and the derivation of parameters using the Yulewalker equation. If you would like to know, please check the relevant part of the paper here.
(1) Build a time series model (= probability density function based on the data up to each time point) at each time point with the SDAR model. (2) Based on the model constructed in (1), find the likelihood that the measured value at the next time point will come out. ③ Find the logarithmic loss of this probability and use it as an outlier score.
The outlier score is calculated as follows.
Score(x_t) = -\log P_{t-1}(x_t|x^{t-1})
Set a width of $ W> 0 $ to smooth outlier scores within that width. By performing smoothing, it is possible to determine whether or not the state has continued for a long time (= whether or not it has changed), not the state of outliers.
Expressed in a mathematical formula, if the smoothed score at point t is $ y_t $, it can be calculated as follows.
y_t = \frac{1}{W}\sum_{t=t-W+1}^{t}Score(x_i)
(1) Further learn the score obtained by smoothing with the SDAR model (2) Based on the model constructed in (1), find the "probability" (adjusted by the forgetting coefficient) that the measured value at the next time will come out. ③ Find the logarithmic loss of this probability and use it as the change point score.
I quoted the sample code of here as it is. We generate random numbers that follow three types of normal distributions and look at their changes.
# -*- coding: utf-
import matplotlib.pyplot as plt
import changefinder
import numpy as np
data=np.concatenate([np.random.normal(0.7, 0.05, 300),
np.random.normal(1.5, 0.05, 300),
np.random.normal(0.6, 0.05, 300),
np.random.normal(1.3, 0.05, 300)])
cf = changefinder.ChangeFinder(r=0.01, order=1, smooth=7)
ret = []
for i in data:
score = cf.update(i)
ret.append(score)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(ret)
ax2 = ax.twinx()
ax2.plot(data,'r')
plt.show()
The result is as follows. The red line is the original data and the blue line is the change point score.
In the sample code above, I set the parameter changefinder.ChangeFinder (r = 0.01, order = 1, smooth = 7). A brief explanation of each parameter is given.
$ r $: Oblivion parameter Shows how much you control past effects when calculating the probability density function. As you can see from the formula of the SDAR model, if this value is made small, the influence of the past becomes large and the variation of the change point becomes large.
order: degree of AR model Set how far past values are included in the model. I haven't yet answered how to understand how to determine this parameter. (Simply determining the order based on AIC or BIC does not make much sense if it is non-stationary data, and the time series data itself to which this model is applied is non-stationary when the order is estimated after processing to have steady state. It's steady ...)
smooth: Range of smoothing The longer this is, the more "changes" can be captured instead of outliers, but if it is made too large, it will be difficult to capture the changes themselves.
In the above, sample data was created and the change points were extracted. I tried a little to see if it is possible to visualize the turning point of a company from the stock price of that company. Below, I checked the changes in the stock price of a "certain company" whose main business is data analysis and analysis system provision.
First, import the required libraries.
# -*- coding: utf-
##The library used this time and the magic
#What you need around changefinder
import changefinder
#If necessary:pip install changefinder
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa import ar_model
#What you need around scraping
import urllib2
import requests
import lxml
from bs4 import BeautifulSoup
import time
Next, get the stock price data. This time, I got the data from this site.
##Data acquisition
dates = []
stock = []
for i in xrange(1,7):
#This time a certain company 2011-Get stock price for 2016
url = urllib2.urlopen('http://k-db.com/stocks/3655-T/1d/201' + str(i))
#Put it to sleep
time.sleep(5)
soup = BeautifulSoup(url)
#Get the number of dates
n_dates = len(soup.find_all("td",{"class": "b"}))
for j in range(n_dates):
date = str(soup.find_all("td",{"class": "l"})[j].text)
stock_day = float(soup.find_all("td",{"class": "b"})[j].text)
dates.append(date)
stock.append(stock_day)
stock_pd = pd.Series(stock,index=dates).sort_index()
print stock_pd.head(10)
Let's visualize the acquired data.
##Visualization
plt.style.use('ggplot')
stock_pd.plot(figsize = (12,4),alpha = 0.5,color = 'b')
plt.title('stockprice', size=14)
This certain company had a high stock price from 2011 to 2012, and it has fallen from there. From there, it looks like it's moving steadily (albeit slightly down).
Now, let's finally calculate the change score with changefinder.
## change finder
cf = changefinder.ChangeFinder(r = 0.01, order = 3, smooth = 14)
change = []
for i in stock_pd:
score = cf.update(i)
change.append(score)
change_pd = pd.Series(change,index=dates)
stock_price = stock_pd.plot(figsize = (12,4), alpha = 0.5, color = 'b',x_compat=True)
ax2 = stock_price.twinx()
ax2.plot(change_pd, alpha=0.5, color = 'r')
plt.title('stockprice(blue) & changescore(red)', size=14)
The result is as follows.
It's a series of changes from 2011 to 2012. After that, it seems that there will be changes around January 2014. In fact, this company was listed on the First Section of the Tokyo Stock Exchange in September 2011. So you can see that the stock price was not stable at first. Also, around January 2014, I was establishing a joint venture with a certain company, so was it quite aggressive in terms of management? It may be said that it is time. After that, the change point is low. Basically, it can be said that it is a stable company without rattling due to such a drastic change. ??
(1) With this algorithm, when you actually move it, you can see that the change point changes as soon as the parameter is changed. However, I didn't know how to decide each parameter properly, so this time I decided the parameters while checking the results "exploratory". I would appreciate it if you could teach me how to make good adjustments here. (2) Is it really possible to say that we are able to handle non-stationary data? It is true that the influence of the past is small depending on the forgetting parameter, but it is not yet convinced why it can be said that "unsteady state is okay".
Also, this time we extracted the change points based on the AR model, but for changefinder There is also a method to calculate the change score based on the ARIMA model. I haven't looked at the theory yet, so I will write it if I feel like it in the future.
--There is an algorithm called changefinder that uses the SDAR model that considers online processing + forgetting effect for time-series change point extraction. --Easy to calculate than traditional change point extraction logic + Can handle unsteady state (apparently) --There is room for improvement in parameter tuning etc. (Please tell me rather)
Recommended Posts