TL;DR If you want to perform a survival time analysis where the covariates are time-varying, you can use a technique called discrete-time logistic regression. We will introduce discrete-time logistic regression and implement it with stan.

At the beginning

Consider a situation where you want to estimate the factors for graduating without repeating a grade from the transition of university grades. The grade of student $ i $ and grade $ t $ is $ X_ {i, t} $, and the grade t is $ Y_ {i, t} $. If you think of a state in which you continue to succeed in promotion as "survival," this is captured within the framework of survival time analysis.

It is expected that the magnitude of the factors that $ X_ {i, t} $ has on promotion varies greatly depending on the grade. For example, it is difficult for 1st and 2nd graders to repeat a year because they can recover later even if they do not get some credits, and 3rd to 4th graders must meet the graduation requirements, so it is easy to lead to a repeat year. I will.

The first thing that comes to mind when thinking about the framework of whether you can graduate without repeating a year is the proportional hazard model used in pharmacy. However, the proportional hazard model assumes that the covariates do not change over time, so if $ X_ {i, t} $ changes over time, you need to scrutinize it.

Therefore, we use a discrete-time logistic regression model.

Discrete-time logistic regression model

In the survival time analysis, the survival function $ S (t) $ (probability of surviving for t periods or longer) and the hazard function $ h (t) $ (probability of surviving to time point t and dying at time point t) are defined as follows. I will. $ S(t) = P(T>t) = \int_{x}^{\infty} f(t)dt \\\ h(t) = \lim_{\Delta \rightarrow \infty} \frac{P(t \leq T < t+\Delta t | T \geq t)}{\Delta t} \\\ $

In the discrete-time logistic regression model, the hazard function is expressed as follows.

h(x,t) = \frac{1}{1 + \exp(-z(x,t))} \\\ z(x,t) = \alpha + \beta_t * X_{t} \\\

This is the same as regular logistic regression. $ \ Beta_t $ represents the difference in the effectiveness of the covariates over time. Change the form of the z (x, t) function to suit your situation.

Implementation example of stan

Suppose the user keeps choosing one of multiple choices at any given time. Estimate which option will prevent withdrawal the most and how the effect of the option will change over time.

model


stan_code = '''
//reference: https://www.slideshare.net/akira_11/dt-logistic-regression
data {
  int T_max;
  int ST;
  int C;
  int X_T[ST];
  matrix[ST, C] X_score;
  int Y[ST];
}

parameters {
  matrix[T_max, C-1] beta;
  real alpha; //Constant term
}

model {
  vector[ST] zeros = rep_vector(0.0, ST); //Vector to fix to 0 when estimating coefficient
  vector[C] ones = rep_vector(1.0, C); //Matrix for summing columns
  vector[ST] mu = alpha + (X_score .* append_col(zeros, beta[X_T, :])) * ones; //Withdrawal probability vector

  for (st in 1:ST) {
    target += bernoulli_logit_lpmf(Y[st] | mu[st]); //Binary classification model for predicting withdrawal
  }
}
'''

Virtual data generation

def logistic_func(array):
    """-∞~Input value of ∞ according to logistic function[0,1]Convert to and return"""
    return 1/(1+np.exp(-array))

#Data generation,Censor to the right
S = 10000
C = 3
alpha = -1 #Default exit rate before multiplying by logistic function, this is about 20%About
T_max = 6
beta = np.array([[0,i,j] for i, j in zip(list(range(-3, 3)), list(range(-3, 3))[::-1])]) * 0.2 #Increase columns to match R,Adjust the effectiveness of the coefficient with the last value
stan_dict = dict()
stan_dict["T_max"] = T_max
stan_dict["C"] = C
stan_dict["class"] = list()
stan_dict["rate"] = list()
stan_dict["X_T"] = list()
stan_dict["X_score"] = list() #Note that this stores an array inside
stan_dict["Y"] = list()
stan_dict["S"] = list() #For debugging

for s in range(S):
    idx = 0
    
    class_ = np.random.choice(list(range(C)), size=1)[0]
    x_score = np.zeros((C))
    x_score[score] = 1.0
    rate = logistic_func(alpha+beta[idx,score])
    
    stan_dict["class"].append(score)
    stan_dict["rate"].append(rate)
    stan_dict["X_T"].append(idx+1)
    stan_dict["X_score"].append(x_score)
    
    while True: #Survival judgment
        if int(np.random.binomial(n=1, p=rate, size=1)):
            y = 1
            stan_dict["Y"].append(y)
            stan_dict["S"].append(idx+1)
            break
        elif idx >= T_max-1: #Censored when it exceeds a certain level
            y = 0
            stan_dict["Y"].append(y)
            stan_dict["S"].append(idx+1)
            break
        y = 0
        stan_dict["Y"].append(y)
        idx += 1
        score = np.random.choice(list(range(R)), size=1)[0]
        x_score = np.zeros((R))
        x_score[score] = 1.0
        rate = logistic_func(alpha+beta[idx,score])
        
        stan_dict["class"].append(score)
        stan_dict["rate"].append(rate)
        stan_dict["X_T"].append(idx+1)
        stan_dict["X_score"].append(x_score)

The true value of beta was set like this. スクリーンショット 2020-05-08 0.42.30.png Choosing the middle option in the early stages will increase the survival rate, but in the final stages, choosing the right option will increase the survival rate than the middle option.

The distribution of survival time looks like this. スクリーンショット 2020-05-08 0.38.43.png

Estimate

#Post-processing of data given to stan
stan_dict["ST"] = np.sum(stan_dict["S"])
stan_dict["X_score"] = np.array(stan_dict["X_score"])

#Model and infer
sm = pystan.StanModel(model_code = stan_code)
fit = sm.sampling(stan_dict, iter=1000, chains=4)
print(fit)

Rhat does not exceed 1.1, so it seems to be converging.

Check the result

True value of beta スクリーンショット 2020-05-08 0.42.30.png

beta estimate スクリーンショット 2020-05-08 0.42.36.png

The values are close to each other. However, the estimation accuracy is worse in the latter half of the survival period.

reference

https://www.slideshare.net/akira_11/dt-logistic-regression

Implement a discrete-time logistic regression model with stan