This is an article to try the time-varying coefficient regression model, which is one of the variations of the state-space model. Since it is easy to use R dlm, this time I will call and execute R dlm from Python through rpy2. For information on how to use the dlm time-varying coefficient model by R, see Logics of Blue "[Dlm-based time-varying coefficient model](http://logics-of-blue.com/dlm%E3%81%AB%E3%82%". 88% E3% 82% 8B% E6% 99% 82% E5% A4% 89% E4% BF% 82% E6% 95% B0% E3% 83% A2% E3% 83% 87% E3% 83% AB / ) ”Was a great reference. In this article, I changed the explanatory variables from one to two from this model and tried using it from Python through rpy2.
As the name implies, it is a model in which the regression coefficient changes with time.
y_t = a_t x_t + b_t + v_t \quad \cdots (1)
$ y_t $: Dependent variable at time $ t $ $ x_t $: Explanatory variable for time $ t $ $ a_t $: Regression coefficient at time point $ t $ $ b_t $: intercept at time $ t $ $ v_t $: Observation error at time point $ t $
As shown, the regression coefficient changes with time. that time,
a_t = a_{t-1} + e_t \quad \cdots (2)
As shown in the above, the error of $ e_t $ is added to the regression coefficient of the previous period, and it becomes the regression coefficient of the period $ t $. In terms of the state space model, (1) is the observation model and (2) is the system model.
This time I would like to try the case where there are two explanatory variables, so the series of equations is as follows.
\begin{aligned}
y_t &= a^{(1)}_t x^{(1)}_t + a^{(2)}_t x^{(2)}_t +b_t + v_t \\
a^{(1)}_t &= a^{(1)}_{t-1} + e^{(1)}_t \\
a^{(2)}_t &= a^{(2)}_{t-1} + e^{(2)}_t \\
b_t &= b_{t-1} + w_t \\
\\
v_t &\sim N(0, \sigma_{v})\\
e^{(1)}_t &\sim N(0, \sigma_{e}^{(1)}) \\
e^{(2)}_t &\sim N(0, \sigma_{e}^{(2)}) \\
w_t &\sim N(0, \sigma_{w})\\
\end{aligned}
There is one observation model and three system models. There are 3 latent variables, $ a ^ {(1)} _t, a ^ {(2)} _t, b_t $ x $ 3t $ of hours $ t $.
After setting various parameters according to the equation above, generate random numbers to create data. The parameter specifications of this artificial data are based on [1].
rd.seed(71)
T = 500
v_sd = 20 #Root-mean-squared of observation error
i_sd = 10 #Standard deviation of intercept
a10 = 10
e_sd1 = .5 #Standard deviation of variation of regression coefficient 1
a20 = 20
e_sd2 = .8 #Standard deviation of variation of regression coefficient 1
#Two time-varying regression coefficients
e1 = np.random.normal(0, e_sd1, size=T)
a1 = e1.cumsum() + a10
e2 = np.random.normal(0, e_sd2, size=T)
a2 = e2.cumsum() + a20
# intercept
intercept = np.cumsum(np.random.normal(0, i_sd, size=T))
#Explanatory variable
x1 = rd.normal(10, 10, size=T)
x2 = rd.normal(10, 10, size=T)
#Dependent variable
v = np.random.normal(0, v_sd, size=T) #Observation error
y = intercept + a1*x1 + a2*x2 + v
y_noerror = intercept + a1*x1 + a2*x2 #Y when there was no observation error
The resulting graph is below. First, a plot of the three latent variables you want to estimate.
Graph of regression coefficient 1
Graph of regression coefficient 2
Intercept graph
Explanatory variables are data that follow two independent normal distributions.
Explanatory variables $ x ^ {(1)} _ t, x ^ {(2)} _ t $
Result $ y $
As a result, the graph of $ y $ doesn't seem to have any structure with just random numbers to the human eye. I would like to confirm whether the latent variables can be estimated well from this data.
Well, here is the production. I would like to call dlm of R from Python through rpy2 to estimate the time-varying coefficient regression model.
Import rpy2.
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
from rpy2.rinterface import R_VERSION_BUILD
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Import the R library, dlm.
dlm = importr("dlm")
Here you specify the type of the state space model. This time, since it is a regression model with explanatory variables, we will use dlmModReg
.
X
is an explanatory variable, and this time there are two, x1
and x2
, so set that data.
dV
is the variance of the observation model error, which represents the variance of $ v_t $ $ \ sigma_v $.
dW
is the error variance of the system model. There are three system models related to $ e ^ {(1)} _t, e ^ {(2)} _t, w_t $ this time.
It corresponds to $ \ sigma_ {e} ^ {(1)}, \ sigma_ {e} ^ {(2)}, \ sigma_w $, respectively.
python
buildDlmReg = lambda theta: dlm.dlmModReg(
X=df[["x1", "x2"]],
dV=np.exp(theta[0]),
dW=[np.exp(theta[1]), np.exp(theta[2]), np.exp(theta[3])]
)
Now that the model type has been determined, set the observed value y
and obtain the parameters by the maximum likelihood method. Since the estimation may be incorrect depending on the initial value, it is estimated using dlmMLE
twice. It means that the result of the first dlmMLE
is used as the initial value and the seconddlmMLE
is executed to improve the stability.
python
#Maximum likelihood estimation in two steps
%time parm = dlm.dlmMLE(y, parm=[2, 1, 1, 1], build=buildDlmReg, method="L-BFGS-B")
par = np.array(get_method(parm, "par"))
print("par:", par)
%time fitDlmReg = dlm.dlmMLE(y, parm=par, build=buildDlmReg, method="SANN")
Let's display the result.
python
show_result(fitDlmReg)
The following par
corresponds to the estimation results of the four standard deviations of $ \ sigma_v, e ^ {(1)} _t, e ^ {(2)} _t, w_t $, but it is set to dlmModReg
. Sometimes I put np.exp
in between to prevent the value from becoming negative, so if I take the result np.exp
and then take np.sqrt
, it will be the standard deviation.
out
par [1] 5.9998846 4.5724213 -1.6572242 -0.9232988
value [1] 2017.984
counts function gradient
10000 NA
convergence [1] 0
message NULL
When you actually calculate, you can see that the parameters specified when creating the artificial data are almost reproduced.
python
par = np.asanyarray(get_method(fitDlmReg, "par"))
modDlmReg = buildDlmReg(par)
estimated_sd = np.sqrt(np.exp(np.array(get_method(fitDlmReg, "par"))))
pd.DataFrame({"estimated_sd":estimated_sd, "sd":[v_sd, i_sd, e_sd1, e_sd2]})
estimated_sd | sd | |
---|---|---|
0 | 20.084378 | 20.0 |
1 | 9.837589 | 10.0 |
2 | 0.436655 | 0.5 |
3 | 0.630243 | 0.8 |
Now, using this result, find the value filtered by the Kalman filter, and then find the smoothed value from that filter value.
python
#Kalman filter
filterDlmReg = dlm.dlmFilter(y, modDlmReg)
#Smoothing
smoothDlmReg = dlm.dlmSmooth(filterDlmReg)
The obtained filtering and smoothing results are returned to the Python world and stored as ndarray
.
python
#Get filtered values
filtered_a = np.array(get_method(filterDlmReg, "a"))
#Get smoothed value
smoothed_a = np.array(get_method(smoothDlmReg, "s"))
Below is a graph of the two results of the intercept ʻinterceptand the regression coefficients $ a ^ {(1)} _ t and a ^ {(2)} _ t $. The original value of the artificial data is shown in blue, the filtered value is shown in green, and the smoothed value is shown in red, but I think that the movement of the original data can be reproduced considerably. Of course, dlm is given only three data,
x1, x2, y`, and the type of the state space model, but the intercept and regression coefficient can be restored properly! (The value is rampant only near the initial value)
python
plt.figure(figsize=(12,5))
plt.plot(df.intercept, label="intercept(original)")
plt.plot(filtered_a[1:,0], label="intercept(filtered)")
plt.plot(smoothed_a[1:,0], label="intercept(smoothed)")
plt.legend(loc="best")
plt.title("plot of intercept")
python
plt.figure(figsize=(12,5))
plt.plot(df.a1, label="a1(original)")
plt.plot(filtered_a[1:,1], label="a1(filtered)")
plt.plot(smoothed_a[1:,1], label="a1(smoothed)")
plt.legend(loc="best")
plt.title("plot of regression coeff a1")
python
plt.figure(figsize=(12,5))
plt.plot(df.a2, label="a2(original)")
plt.plot(filtered_a[1:,2], label="a2(filtered)")
plt.plot(smoothed_a[1:,2], label="a2(smoothed)")
plt.legend(loc="best")
plt.title("plot of regression coeff a2")
Finally, let's compare the one without the observation error of the y value and the one calculated with y using the smoothed latent variable. You can see that the true y value can be reproduced considerably by removing the observation error here as well!
python
estimatedLevel = smoothed_a[1:,0] + df.x1*smoothed_a[1:,1] + df.x2*smoothed_a[1:,2]
python
plt.figure(figsize=(12,5))
plt.plot(y_noerror, lw=4, label="y(original)")
plt.plot(estimatedLevel, "r", lw=1, label="y(estimated) [no observation error]")
plt.legend(loc="best")
plt.title("plot of target value.")
[1] Logics of blue "Time-varying coefficient model by dlm" http://logics-of-blue.com/dlm%E3%81%AB%E3%82%88%E3%82%8B%E6%99%82%E5%A4%89%E4%BF%82%E6%95%B0%E3%83%A2%E3%83%87%E3%83%AB/
[2] This code is uploaded (github) https://github.com/matsuken92/Qiita_Contents/blob/master/General/dlm_time_varying_regression.ipynb
[2] Version of this code that can be supported by n variables (github) https://github.com/matsuken92/Qiita_Contents/blob/master/General/dlm_time_varying_regression_n-var.ipynb
Recommended Posts