Bayesian Survival Analysis with Data Augmentation
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Motivation
When dealing with time-to-event data, right-censoring is a common occurance. Although most are familiar with likelihood construction under right-censoring (and corresponding frequentist estimation), there’s very little available online about Bayesian approaches even for fully parametric models. Here I’ll briefly outline a Bayesian estimation procedure for a Weibull model with right-censoring. The estimation procedure is MCMC based using a data augmentation approach.
As with most of my posts, all MCMC is coded from scratch. It helps me and it helps readers understand the underlying algorithm – an intuition that is more difficult to get if you’re just specifying the model in Stan.
Model Set Up
Suppose we observe i=1,…,r survival times, Toi. Survival times past the end of our study (at time τ) are censored for subjects i=r+1,…,n. We know that the survival times for these subjects are greater than τ, but that is all. Say we also have some p×1 covariate vector, xi. Finally, we have indicator of whether survival time is observed δ1:n for each subject. A parametric approach follows by assuming a model for T, we choose the Weibull
Toi∼Weibull(α,λi)
We’ll consider the setting where we regress on a binary treatment indicator, μi=β0+β1A where A=1 indicates treated and A=0 indicates untreated/placebo. This is a funky reparameterization, but it yields intuitive interpretations for β1 in terms of the Weibull’s hazard function, h(t|β,x,α)=λiαxα−1. Substituting λi, we see the hazard for treated subjects is h(t|A=1)=e−(β0+β1)∗ααtα−1 and for untreated subjects it is h(t|A=1)=e−(β0)∗ααtα−1. The hazard ratio is,
HR=h(t|A=1)h(t|A=0)=e−β1∗α
From a Bayesian point of view, we are interested in the posterior p(β,α|To1:r,δ1:n,τ). Once we have this, we can get a whole posterior distribution for the survival function itself – as well as any quantity derived from it. For example, posterior mean and credible intervals for HR (just a function of β1 and α). We can also get posterior survival curve estimates for each treatment group. For the Weibull, the survival curve is given by S(t|β,α,A)=exp(−λtα) – again just a function of β1 and α.
Data Augmentation
We’ll first look at the joint data distribution (the likelihood) for this problem. The central idea is to view the survival times for the n−r censored subjects as missing data, Tmr+1:n. We refer to the full data as Ti=1:n=(Toi:r,Tmr+1:n). Now we construct a complete-data (augmented) likelihood with these values. The observed likelihood and complete-data likelihood are related by
p(To1:r,δ1:n|τ,β,α)=∫p(T1:n,δ1:n|τ,β,α) dTmr+1:n=∫p(δ1:n|T1:n,τ,β,α) p(T1:n|τ,β,α) dTmr+1:n
We also assume that subjects are independent so that p(Ti=1:n|τ,β,α)=p(To1:r|τ,β,α)p(Tmr+1:n|τ,β,α). So the likelihood simplifies to: p(To1:r,δ1:n|τ,β,α)=n∏i=1∫p(δi|Ti,τ,β,α) p(Ti|τ,β,α) dTmr+1:n=∏i|δi=0p(Toi|τ,β,α)∏i|δi=1∫p(δi|Tmi,τ,β,α) p(Tmi|τ,β,α) dTmi=∏i|δi=0p(Toi|τ,β,α)∏i|δi=1∫I(Tmi>τ) p(Tmi|τ,β,α) dTmi=∏i|δi=0p(Toi|τ,β,α)∏i|δi=1∫∞τ p(Tmi|τ,β,α) dTmi
Now the integral is over the region Tmi∈(0,∞). But in this region p(δi|Tmi,τ,β,α)=1 only when Tmi>τ.
This is the usual likelihood for frequentist survival models: uncensored subjects contribute to the likelihood via the density while censored subjects contribute to the likelihood via the survival function ∫∞τ p(Tmi|τ,β,α) dTmi. Functions for this integral exist in for most basic distributions in R
. For our Weibull model, it is 1-pweibull()
. We would simply place priors on β and α, then sample from the posterior using MCMC.
But what if this integral was too hard to evaluate (as it may be for more complicated censoring mechanisms) and the complete data likelihood given below is easier?
p(To1:r,Tmr+1:n,δ1:n|τ,β,α)=∏i|δi=0p(Toi|τ,β,α)∏i|δi=1I(Tmi>τ) p(Tmi|τ,β,α)
Metropolis-in-Gibbs Sampler
The target posterior of interest is p(β,α,Tmr+1:n|To1:r,δ1:n)=p(β,α|Tmr+1:n,To1:r,δ1:n) p(Tmr+1:n|β,α,To1:r,δ1:n)
p(β,α|Tmr+1:n,To1:r,δ1:n)∝∏i|δi=0p(Toi|τ,β,α)∏i|δi=1I(Tmi>τ) p(Tmi|τ,β,α)∝p(β,α)n∏i=1p(Ti|τ,β,α)
This is a truncated Weibull distribution (truncated at the bottom by τ). We can also sample from this using a Metropolis step.
The Gibbs sampler alternates between sampling from these two conditionals:
- Given parameters (β,α), impute Tmi by drawing from p(Tmr+1:n|β,α,To1:r,δ1:n), for each i=r+1,…,n.
- Combine these imputed values, Tmr+1:n, with observed data To1:n, and update the parameters (β,α) from p(β,α|Tmr+1:n,To1:r,δ1:n).
As the parameter estimates update, the imputations get better. As the imputations get better, the parameter estimates improve. Over time the process yields draws from the joint posterior p(β,α,Tmr+1:n|To1:r,δ1:n)
We retain the sample of (β,α) for inference and toss samples of Tm.
Simulation Example in R
All of the code implementing the augmented sampler (from scratch!) can be found on my GitHub. Basically I simulate a data set with a binary treatment indicator for 1,000 subjects with censoring and survival times independently drawn from a Weibull. \
For the β vector, I use independent N(0,sd=100) priors. For the shape parameter, I use an Exp(1) prior. I run a single MCMC chain for 20,000 iterations and toss the first 15,000 out as burn-in.
Here is the estimated survival function for each treatment group. Overlayed are the non-parametric estimates from a stratified Kaplan-Meier (KM) estimator. Note the parametric model is correctly specified here, so it does just as well as the KM in terms of estimating the mean curve. But the parametric model provides a less noisy fit – notice the credible bands are narrower at later time points when the at-risk counts get low in each treatment arm.
That’s just a helpful reminder of the efficiency gains parametric models have over nonparametric ones (when they’re correctly specified. Let’s take a look at the posterior distribution of the hazard ratio. The true value is indicated by the red line.
We could have run this thing for longer (and with multiple chains with different starting values). But I think this gets the point across. The posterior mean and 95% credible interval are .32 (.24−.40). The true value is .367. Not too bad. Remember this is only a single simulated dataset.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.