TL;DR If you want to perform a survival time analysis where the covariates are time-varying, you can use a technique called discrete-time logistic regression. We will introduce discrete-time logistic regression and implement it with stan.
Consider a situation where you want to estimate the factors for graduating without repeating a grade from the transition of university grades. The grade of student $ i $ and grade $ t $ is $ X_ {i, t} $, and the grade t is $ Y_ {i, t} $. If you think of a state in which you continue to succeed in promotion as "survival," this is captured within the framework of survival time analysis.
It is expected that the magnitude of the factors that $ X_ {i, t} $ has on promotion varies greatly depending on the grade. For example, it is difficult for 1st and 2nd graders to repeat a year because they can recover later even if they do not get some credits, and 3rd to 4th graders must meet the graduation requirements, so it is easy to lead to a repeat year. I will.
The first thing that comes to mind when thinking about the framework of whether you can graduate without repeating a year is the proportional hazard model used in pharmacy. However, the proportional hazard model assumes that the covariates do not change over time, so if $ X_ {i, t} $ changes over time, you need to scrutinize it.
Therefore, we use a discrete-time logistic regression model.
In the survival time analysis, the survival function $ S (t) $ (probability of surviving for t periods or longer) and the hazard function $ h (t) $ (probability of surviving to time point t and dying at time point t) are defined as follows. I will.
In the discrete-time logistic regression model, the hazard function is expressed as follows.
This is the same as regular logistic regression. $ \ Beta_t $ represents the difference in the effectiveness of the covariates over time. Change the form of the z (x, t) function to suit your situation.
Suppose the user keeps choosing one of multiple choices at any given time. Estimate which option will prevent withdrawal the most and how the effect of the option will change over time.
stan_code = '''
//reference: https://www.slideshare.net/akira_11/dt-logistic-regression
data {
int T_max;
int ST;
int C;
int X_T[ST];
matrix[ST, C] X_score;
int Y[ST];
}
parameters {
matrix[T_max, C-1] beta;
real alpha; //Constant term
}
model {
vector[ST] zeros = rep_vector(0.0, ST); //Vector to fix to 0 when estimating coefficient
vector[C] ones = rep_vector(1.0, C); //Matrix for summing columns
vector[ST] mu = alpha + (X_score .* append_col(zeros, beta[X_T, :])) * ones; //Withdrawal probability vector
for (st in 1:ST) {
target += bernoulli_logit_lpmf(Y[st] | mu[st]); //Binary classification model for predicting withdrawal
}
}
'''
def logistic_func(array):
"""-∞~Input value of ∞ according to logistic function[0,1]Convert to and return"""
return 1/(1+np.exp(-array))
#Data generation,Censor to the right
S = 10000
C = 3
alpha = -1 #Default exit rate before multiplying by logistic function, this is about 20%About
T_max = 6
beta = np.array([[0,i,j] for i, j in zip(list(range(-3, 3)), list(range(-3, 3))[::-1])]) * 0.2 #Increase columns to match R,Adjust the effectiveness of the coefficient with the last value
stan_dict = dict()
stan_dict["T_max"] = T_max
stan_dict["C"] = C
stan_dict["class"] = list()
stan_dict["rate"] = list()
stan_dict["X_T"] = list()
stan_dict["X_score"] = list() #Note that this stores an array inside
stan_dict["Y"] = list()
stan_dict["S"] = list() #For debugging
for s in range(S):
idx = 0
class_ = np.random.choice(list(range(C)), size=1)[0]
x_score = np.zeros((C))
x_score[score] = 1.0
rate = logistic_func(alpha+beta[idx,score])
stan_dict["class"].append(score)
stan_dict["rate"].append(rate)
stan_dict["X_T"].append(idx+1)
stan_dict["X_score"].append(x_score)
while True: #Survival judgment
if int(np.random.binomial(n=1, p=rate, size=1)):
y = 1
stan_dict["Y"].append(y)
stan_dict["S"].append(idx+1)
break
elif idx >= T_max-1: #Censored when it exceeds a certain level
y = 0
stan_dict["Y"].append(y)
stan_dict["S"].append(idx+1)
break
y = 0
stan_dict["Y"].append(y)
idx += 1
score = np.random.choice(list(range(R)), size=1)[0]
x_score = np.zeros((R))
x_score[score] = 1.0
rate = logistic_func(alpha+beta[idx,score])
stan_dict["class"].append(score)
stan_dict["rate"].append(rate)
stan_dict["X_T"].append(idx+1)
stan_dict["X_score"].append(x_score)
The true value of beta was set like this. Choosing the middle option in the early stages will increase the survival rate, but in the final stages, choosing the right option will increase the survival rate than the middle option.
The distribution of survival time looks like this.
#Post-processing of data given to stan
stan_dict["ST"] = np.sum(stan_dict["S"])
stan_dict["X_score"] = np.array(stan_dict["X_score"])
#Model and infer
sm = pystan.StanModel(model_code = stan_code)
fit = sm.sampling(stan_dict, iter=1000, chains=4)
print(fit)
Rhat does not exceed 1.1, so it seems to be converging.
True value of beta
beta estimate
The values are close to each other. However, the estimation accuracy is worse in the latter half of the survival period.
https://www.slideshare.net/akira_11/dt-logistic-regression
Recommended Posts