Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The ability of PGAMs to estimate the log-baseline hazard rate, endows them with the capability to be used as smooth alternatives to the Kaplan Meier curve. If we assume for the shake of simplicity that there are no proportional co-variates in the PGAM regression, then the quantity modeled corresponds to the log-hazard of the survival function. Note that the only assumptions made by the PGAM is that the a) log-hazard is a smooth function, with b) a given maximum complexity (number of degrees of freedom) and c) continuous second derivatives. A PGAM provides estimates of the log-hazard constant,
From the above definition it is obvious that the value of the survival distribution at any given time point is a non-linear function of the PGAM estimate. Consequently, the predicted survival value,
## Calculate survival and confidence interval over a grid of points ## using a GAM SurvGAM<-Vectorize(function(t,gm,gl.rule,CI=0.95,Nsim=1000,seed=0) ## t : time at which to calculate relative risk ## gm : gam model for the fit ## gl.rule : GL rule (list of weights and nodes) ## CI : CI to apply ## Nsim : Number of replicates to draw ## seed : RNG seed { q<-(1-CI)/2.0 ## create the nonlinear contrast pdfnc<-data.frame(stop=t,gam.dur=1) L<-length(gl.rule$x) start<-0; ## only for right cens data ## map the weights from [-1,1] to [start,t] gl.rule$w<-gl.rule$w*0.5*(t-start) ## expand the dataset df<-Survdataset(gl.rule,pdfnc,fu=1) ## linear predictor at each node Xp <- predict(gm,newdata=df,type="lpmatrix") ## draw samples set.seed(seed) br <- rmvn(Nsim,coef(gm),gm$Vp) res1<-rep(0,Nsim) for(i in 1:Nsim){ ## hazard function at the nodes hz<-exp(Xp%*%br[i,]) ## cumumative hazard chz1<-gl.rule$w %*% hz[1:L,] ##survival res1[i]<-exp(-chz1) } ret<-data.frame(t=t,S=mean(res1), LCI=quantile(res1,prob=q), UCI=quantile(res1,prob=1-q)) ret },vectorize.args=c("t"))
The function makes use of another function, Survdataset, that expands internally the vector of time points into a survival dataset. This dataset is used to obtain predictions of the log-hazard function by calling the predict function from the mgcv package.
## Function that expands a prediction dataset ## so that a GL rule may be applied ## Used in num integration when generating measures of effect Survdataset<-function(GL,data,fu) ## GL : Gauss Lobatto rule ## data: survival data ## fu: column number containing fu info { ## append artificial ID in the set data$id<-1:nrow(data) Gllx<-data.frame(stop=rep(GL$x,length(data$id)), t=rep(data[,fu],each=length(GL$x)), id=rep(data$id,each=length(GL$x)), start=0) ## Change the final indicator to what ## was observed, map node positions, ## weights from [-1,1] back to the ## study time Gllx<-transform(Gllx, stop=0.5*(stop*(t-start)+(t+start))) ## now merge the remaining covariate info Gllx<-merge(Gllx,data[,-c(fu)]) nm<-match(c("t","start","id"),colnames(Gllx)) Gllx[,-nm] }
The ability to draw samples from the multivariate normal distribution corresponding to the model estimates and its covariance matrix is provided by another function, rmvn:
## function that draws multivariate normal random variates with ## a given mean vector and covariance matrix ## n : number of samples to draw ## mu : mean vector ## sig : covariance matrix rmvn <- function(n,mu,sig) { ## MVN random deviates L <- mroot(sig);m <- ncol(L); t(mu + L%*%matrix(rnorm(m*n),m,n)) }
To illustrate the use of these functions we revisit the PBC example from the 2nd part of this blog series. Firstly, let’s obtain a few Gauss-Lobatto lists of weights/nodes for the integration of the survival function:
## Obtain a few Gauss Lobatto rules to integrate the survival ## distribution GL5<-GaussLobatto(5); GL10<-GaussLobatto(10); GL20<-GaussLobatto(20);
Subsequently, we fit the log-hazard rate to the coarsely (5 nodes) and more finely discretized (using a 10 point Gauss Lobatto rule) versions of the PBC dataset, created in Part 2. The third command obtains the Kaplan Meier estimate in the PBC dataset.
fSurv1<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)), data=pbcGAM,family="poisson",scale=1,method="REML") fSurv2<-gam(gam.ev~s(stop,bs="cr")+offset(log(gam.dur)), data=pbcGAM2,family="poisson",scale=1,method="REML") KMSurv<-survfit(Surv(time,status)~1,data=pbc)
We obtained survival probability estimates for the 6 combinations of time discretization for fitting (either a 5 or 10th order Lobatto rule) and prediction (a 5th, 10th or 20th order rule):
t<-seq(0,4500,100) s1<-SurvGAM(t,fSurv1,GL5) s2<-SurvGAM(t,fSurv1,GL10) s3<-SurvGAM(t,fSurv1,GL20) s4<-SurvGAM(t,fSurv2,GL5) s5<-SurvGAM(t,fSurv2,GL10) s6<-SurvGAM(t,fSurv2,GL20)
In all cases 1000 Monte Carlo samples were obtained for the calculation of survival probability estimates and their pointwise 95% confidence intervals. We can plot these against the Kaplan Meier curve estimates:
par(mfrow=c(2,3)) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL5)") lines(s1[1,],s1[2,],col="blue",lwd=2) lines(s1[1,],s1[3,],col="blue",lwd=2,lty=2) lines(s1[1,],s1[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL10)") lines(s2[1,],s2[2,],col="blue",lwd=2) lines(s2[1,],s2[3,],col="blue",lwd=2,lty=2) lines(s2[1,],s2[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL5)/Predict(GL20)") lines(s3[1,],s3[2,],col="blue",lwd=2) lines(s3[1,],s3[3,],col="blue",lwd=2,lty=2) lines(s3[1,],s3[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL5)") lines(s4[1,],s4[2,],col="blue",lwd=2) lines(s4[1,],s4[3,],col="blue",lwd=2,lty=2) lines(s4[1,],s4[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL10)") lines(s5[1,],s5[2,],col="blue",lwd=2) lines(s5[1,],s5[3,],col="blue",lwd=2,lty=2) lines(s5[1,],s5[4,],col="blue",lwd=2,lty=2) plot(KMSurv,xlab="Time (days)",ylab="Surv Prob",ylim=c(0.25,1),main="Fit(GL10)/Predict(GL20)") lines(s6[1,],s6[2,],col="blue",lwd=2) lines(s6[1,],s6[3,],col="blue",lwd=2,lty=2) lines(s6[1,],s6[4,],col="blue",lwd=2,lty=2)
Overall, there is a close agreement between the Kaplan Meier estimate and the PGAM estimates despite the different function spaces that the corresponding estimators “live”: the space of all piecewise constant functions (KM) v.s. that of the smooth functions with bounded, continuous second derivatives (PGAM). Furthermore, the 95% confidence interval of each estimator (dashed lines) contain the expected value of the other estimator. This suggests that there is no systematic difference between the KM and the PGAM survival estimators. This was confirmed in simulated datasets (see Fig 2 in our PLoS One paper).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.