1. Overview

Continuing from Part 5, I am studying based on "Measurement time series analysis of economic and financial data".
This article is about spurious regression and cointegration in Chapter 6.

2. Spurious regression

Definition

There seems to be a significant relationship between $ x_t $ and $ y_t $ when regressing $ y_t = \ alpha + \ beta x_t + \ epsilon_t $ for two unrelated unit root processes $ x_t $ and $ y_t $. The phenomenon that looks like is called spurious regression.

Verification

Two independent processes
$ \ qquad x_t = x_ {t-1} + \ epsilon_ {x, t}, \ quad \ epsilon_ {x, t} \ sim iid (0, \ sigma_x ^ 2) $
$ \ qquad y_t = y_ {t-1} + \ epsilon_ {y, t}, \ quad \ epsilon_ {y, t} \ sim iid (0, \ sigma_y ^ 2) $

$ \ qquad y_t = \ alpha + \ beta x_t + \ epsilon_t $
Return to the model.

#Data generation
sigma_x, sigma_y = 1, 2
T = 10000
xt = np.cumsum(np.random.randn(T) * sigma_x).reshape(-1, 1)
yt = np.cumsum(np.random.randn(T) * sigma_y).reshape(-1, 1)

When plotted, it looks like the following.

First, regression was performed with scikit-learn.

from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(xt,yt)
print('R-squared : ',reg.score(xt,yt))
print('coef : ',reg.coef_, 'intercept', reg.intercept_)

R-squared : 0.4794854506874714 coef : [[-0.62353254]] intercept [-24.27600549]

The coefficient of determination ($ R ^ 2 $) was 0.479, which was reasonably high. For the regression model, it was in the form of $ \ alpha = -24.28, \ quad \ beta = -0.6235 $.
To test whether $ x_t $ and $ y_t $ are independent, we want to test $ H_0: \ beta = 0 $. However, scikit-learn didn't find that much functionality.
When I searched for other libraries, statsmodels seemed to be useful, so I regressed with statsmodels again.

import statsmodels.api as sm
reg = sm.OLS(yt,sm.add_constant(xt,prepend=False)).fit()
reg.summary()


Dep. Variable:	y	R-squared:	0.479
Model:	OLS	Adj. R-squared:	0.479
Method:	Least Squares	F-statistic:	9210.
Date:	Tue, 07 Jan 2020	Prob (F-statistic):	0.00
Time:	22:36:57	Log-Likelihood:	-51058.
No. Observations:	10000	AIC:	1.021e+05
Df Residuals:	9998	BIC:	1.021e+05
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>abs(t)	[0.025	0.975]
const	-24.2760	0.930	-26.113	0.000	-26.098	-22.454
x1	-0.6235	0.006	-95.968	0.000	-0.636	-0.611

add_constant is related to whether to include a constant term ($ \ alpha $ in the regression equation) in the regression model. By doing add_constant, the constant term will be included in the regression model. In the case of scikit-learn, if the argument fit_intercept is set to False, the regression will have no constant term. This is not specified above, because fit_intercept = True is the default value.
As with scikit-learn, the coefficient of determination was 0.479, $ \ alpha = -24.28, \ quad \ beta = -0.6235 $, confirming that equivalent regression was achieved.
The nice thing about statsmodels is that they give us 95% significance. Looking at this, for $ H_0: \ beta = 0 $, $ H_0 $ will be rejected because it must be -0.636 or more and -0.611 or less at the 95% significance level. This is a spurious regression.

How to avoid

Include lag variables in the model

Change the regression model as follows.
$ \ qquad y_t = \ alpha + \ beta_1 x_t + \ beta_2 y_ {t-1} + \ epsilon_t $
$ y_ {t-1} $ is added to the explanatory variable of $ y_t $.
When regressing using stats models, it becomes as follows. sm.OLS takes explained variables and explanatory variables as arguments, but it is necessary to feed the explanatory variables together in one array as shown below.

x_t, y_t, y_t_1 = xt[1:], yt[1:], yt[:-1]
X = np.column_stack((x_t, y_t_1))
reg = sm.OLS(y_t,sm.add_constant(X)).fit()
reg.summary()


Dep. Variable:	y	R-squared:	0.999
Model:	OLS	Adj. R-squared:	0.999
Method:	Least Squares	F-statistic:	3.712e+06
Date:	Thu, 09 Jan 2020	Prob (F-statistic):	0.00
Time:	22:12:59	Log-Likelihood:	-21261.
No. Observations:	9999	AIC:	4.253e+04
Df Residuals:	9996	BIC:	4.255e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>abs(t)	[0.025	0.975]
const	-0.0815	0.049	-1.668	0.095	-0.177	0.014
x1	-0.0004	0.000	-0.876	0.381	-0.001	0.000
x2	0.9989	0.001	1964.916	0.000	0.998	1.000

In the previous model, the result was $ \ alpha = -0.0815, \ quad \ beta_1 = -0.0004, \ quad \ beta_2 = 0.9899 $. $ \ Alpha $ and $ \ beta_1 $ are almost 0, most of which can be explained by $ y_ {t-1} $. And the correlation coefficient is 0.999, which is as close to 1 as possible. Also note that $ H_0: \ beta_1 = 0 $ is not rejected.

Regression after taking the difference of the unit root process and making it a stationary process

Change the regression model as follows.
$ \ qquad \ Delta y_t = \ alpha + \ beta \ Delta x_t + \ epsilon_t $

x_t, y_t = np.diff(xt.flatten()).reshape(-1,1), np.diff(yt.flatten()).reshape(-1,1)
reg = sm.OLS(y_t,sm.add_constant(x_t)).fit()
reg.summary()


Dep. Variable:	y	R-squared:	0.000
Model:	OLS	Adj. R-squared:	0.000
Method:	Least Squares	F-statistic:	3.297
Date:	Thu, 09 Jan 2020	Prob (F-statistic):	0.0694
Time:	22:33:26	Log-Likelihood:	-21262.
No. Observations:	9999	AIC:	4.253e+04
Df Residuals:	9997	BIC:	4.254e+04
Df Model:	1
Covariance Type:	nonrobust

coef	std err	t	P>abs(t)	[0.025	0.975]
const	-0.0138	0.020	-0.681	0.496	-0.054
x1	-0.0374	0.021	-1.816	0.069	-0.078

In this case, the correlation coefficient is 0, which is almost 0 at $ \ beta = -0.0374 $. It is also not possible to reject $ H_0: \ beta_1 = 0 $, leading to the conclusion that there is no significant relationship between $ \ Delta x_t $ and $ \ Delta y_t $.

3. Cointegration

Definition

Let $ x_t $ and $ y_t $ be the unit root process ($ \ rm I (1) $). At this time, when $ a $ and $ b $, which are stationary processes, exist like $ a x_t + by_t \ sim \ rm I (0) $, there is a cointegration relationship between $ x_t $ and $ y_t $. There is. Also, $ (a, b)'$ is called a cointegration vector.
More generally, for $ \ mathbb y_t \ sim \ rm I (1) $, there exists $ \ mathbb a $ such that $ \ mathbb a'\ mathbb y_t \ sim \ rm I (0) $ Then, $ \ mathbb y_t $ has a cointegration relationship. Also, $ \ mathbb a $ is called a cointegration vector.
For example, $ u_ \ {1t}, u_ \ {2t} $ is an independent stationary process, and $ w_ \ {1t}, w_ \ {2t} $ is an independent unit root process.
$ \ qquad \ left \ {\ begin {array} {ll} x_t = \ alpha w_ \ {1t} + u_ \ {1t} \\ y_t = \ beta w_ \ {1t} + u_ \ {2t} \ end { Consider array} \ right. $
. At this time, both $ x_t $ and $ y_t $ are in the $ \ rm I (1) $ process, but
$ \ qquad x_t-\ frac {\ alpha} {\ beta} y_t = u_ {1t}-\ Since frac {\ alpha} {\ beta} u_ {2t} \ sim \ rm I (0) $
, there is a cointegration relationship between $ x_t $ and $ y_t $, and the cointegration vector is . (1,-\ frac {\ alpha} {\ beta})'.

Implication

When $ x_t $ and $ y_t $ are unit root processes, the error in long-term prediction of $ x_t $ and $ y_t $ becomes large.
However, if there is a cointegration between $ x_t $ and $ y_t $, then there exists $ a $ such that $ z_t = y_t --a x_t $ is a stationary process, and then $ z_t $ is long-term It is possible to make accurate predictions with a certain degree of accuracy.

Granger Representation theorem

The VAR model including the cointegration relationship can be expressed by the vector error correction model (VECM).
For the republican system $ \ mathbb y_t $ with VAR (p) representation,
$ \ qquad \ begin {align} \ Delta \ mathbb y_t & = \ zeta_1 \ Delta \ mathbb y_ {t-1} + \ zeta_2 \ Delta \ mathbb y_ {t-2} + \ cdots + \ zeta_ {p-1} \ Delta \ mathbb y_ {t-p + 1} + \ mathbb \ alpha + \ zeta_0 \ Delta \ mathbb y_ {t- 1} + \ epsilon_t \\ & = \ zeta_1 \ Delta \ mathbb y_ {t-1} + \ zeta_2 \ Delta \ mathbb y_ {t-2} + \ cdots + \ zeta_ {p-1} \ Delta \ mathbb y_ {t-p + 1} + \ mathbb \ alpha +-\ mathbb B \ mathbb A'\ mathbb y_ {t-1} + \ epsilon_t \ end {align} $
with VECM (p-1) Can be expressed.
$-\ mathbb B \ mathbb A'\ mathbb y_ {t-1} $ is called the error correction term. Here, $ \ mathbb A $ represents the cointegration vector, and the error correction term indicates that the force to return to the equilibrium works when the deviation from the equilibrium becomes large.

Time series analysis # 6 Spurious regression and cointegration

1. Overview

2. Spurious regression

Definition

Verification

How to avoid

Include lag variables in the model

Regression after taking the difference of the unit root process and making it a stationary process

3. Cointegration

Definition

Implication

Granger Representation theorem