Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the second part of the series we will consider the time discretization that makes the Poisson GAM approach to survival analysis possible.
Consider a set of s
where
where
In order for the construct to work one has to ensure that the corresponding lifetimes are mapped to a node of the integration scheme. In our paper, this was accomplished by the adoption of the Gauss-Lobatto rule. The nodes and weights of the Gauss-Lobatto rule (which is defined in the interval
GaussLobatto<-function(N) { N1<-N N<-N-1 x=matrix(cos(pi*(0:N)/N),ncol=1) x=cos(pi*(0:N)/N) P<-matrix(0,N1,N1) xold<-2 while (max(abs(x-xold))>2.22044604925031e-16) { xold<-x P[,1]<-1 P[,2]<-x for (k in 2:N) { P[,k+1]=( (2*k-1)*x*P[,k]-(k-1)*P[,k-1] )/k; } x<-xold-( x*P[,N1]-P[,N] )/( N1*P[,N1] ) } w<-2./(N*N1*P[,N1]^2); ret<-list(x=rev(x),w=w) attr(ret,"order")<-N ret }
which can be called to return a list of the nodes and their weights:
> GaussLobatto(5) $x [1] -1.0000000 -0.6546537 0.0000000 0.6546537 1.0000000 $w [1] 0.1000000 0.5444444 0.7111111 0.5444444 0.1000000 attr(,"order") [1] 4
To prepare a survival dataset for GAM fitting, one needs to call this function to obtain a Gauss Lobatto rule of the required order. Once this has been obtained, the following R function will expand the (right-censored) dataset to include the pseudo-observations at the nodes of the quadrature rule:
GAMSurvdataset<-function(GL,data,fu,d) ## GL : Gauss Lobatto rule ## data: survival data ## fu: column number containing fu info ## d: column number with event indicator { ## append artificial ID in the set data$id<-1:nrow(data) Gllx<-data.frame(stop=rep(GL$x,length(data$id)), gam.dur=rep(GL$w,length(data$id)), t=rep(data[,fu],each=length(GL$x)), ev=rep(data[,d],each=length(GL$x)), id=rep(data$id,each=length(GL$x)), gam.ev=0,start=0) ## Change the final indicator to what ## was observed, map node positions, ## weights from [-1,1] back to the ## study time Gllx<-transform(Gllx, gam.ev=as.numeric((gam.ev | ev)*I(stop==1)), gam.dur=0.5*gam.dur*(t-start), stop=0.5*(stop*(t-start)+(t+start))) ## now merge the remaining covariate info Gllx<-merge(Gllx,data[,-c(fu,d)]) Gllx }
We illustrate the use of these functions on the Primary Biliary Cirrhosis dataset that comes with R:
data(pbc) > ## Change transplant to alive > pbc$status[pbc$status==1]<-0 > ## Change event code of death(=2) to 1 > pbc$status[pbc$status==2]<-1 > > head(pbc) id time status trt age sex ascites hepato spiders edema bili chol albumin copper 1 1 400 1 1 58.76523 f 1 1 1 1.0 14.5 261 2.60 156 2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302 4.14 54 3 3 1012 1 1 70.07255 m 0 0 0 0.5 1.4 176 3.48 210 4 4 1925 1 1 54.74059 f 0 1 1 0.5 1.8 244 2.54 64 5 5 1504 0 2 38.10541 f 0 1 1 0.0 3.4 279 3.53 143 6 6 2503 1 2 66.25873 f 0 1 0 0.0 0.8 248 3.98 50 alk.phos ast trig platelet protime stage 1 1718.0 137.95 172 190 12.2 4 2 7394.8 113.52 88 221 10.6 3 3 516.0 96.10 55 151 12.0 4 4 6121.8 60.63 92 183 10.3 4 5 671.0 113.15 72 136 10.9 3 6 944.0 93.00 63 NA 11.0 3 > > GL<-GaussLobatto(5) > pbcGAM<-GAMSurvdataset(GL,pbc,2,3) > head(pbcGAM) id stop gam.dur t ev gam.ev start trt age sex ascites hepato spiders 1 1 0.00000 20.0000 400 1 0 0 1 58.76523 f 1 1 1 2 1 69.06927 108.8889 400 1 0 0 1 58.76523 f 1 1 1 3 1 200.00000 142.2222 400 1 0 0 1 58.76523 f 1 1 1 4 1 330.93073 108.8889 400 1 0 0 1 58.76523 f 1 1 1 5 1 400.00000 20.0000 400 1 1 0 1 58.76523 f 1 1 1 6 2 0.00000 225.0000 4500 0 0 0 1 56.44627 f 0 1 1 edema bili chol albumin copper alk.phos ast trig platelet protime stage 1 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 2 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 3 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 4 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 5 1 14.5 261 2.60 156 1718.0 137.95 172 190 12.2 4 6 0 1.1 302 4.14 54 7394.8 113.52 88 221 10.6 3 > > dim(pbc) [1] 418 20 > dim(pbcGAM) [1] 2090 24
The original (pbc) dataset has been expanded to include the pseudo-observations at the nodes of the Lobatto rule. There are multiple records (5 per individual in this particular case) as can be seen by examining the data for the first patient (id=1). The corresponding times are found in the variable stop, their associated weights in the variable gam.dur and the event indicators are in the column gam.ev. Note that nodes and weights are expressed on the scale of the survival dataset, not in the scale of the Lobatto rule (
The following code will obtain an adjusted (for age and sex) hazard ratio using the PGAM or the Cox model:
library(survival) ## for coxph > library(mgcv) ## for mgcv > > ## Prop Hazards Modeling with PGAM > fGAM<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)), + data=pbcGAM,family="poisson",scale=1,method="REML") > > ## Your Cox Model here > f<-coxph(Surv(time,status)~trt+age+sex,data=pbc) > > > summary(fGAM) Family: poisson Link function: log Formula: gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur)) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.345236 0.655176 -15.790 < 2e-16 *** trt 0.069546 0.181779 0.383 0.702 age 0.038488 0.008968 4.292 1.77e-05 *** sexf -0.370260 0.237726 -1.558 0.119 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(stop) 1.008 1.015 4.186 0.0417 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = -0.249 Deviance explained = 2.25% -REML = 693.66 Scale est. = 1 n = 1560 > f Call: coxph(formula = Surv(time, status) ~ trt + age + sex, data = pbc) coef exp(coef) se(coef) z p trt 0.0626 1.065 0.182 0.344 7.3e-01 age 0.0388 1.040 0.009 4.316 1.6e-05 sexf -0.3377 0.713 0.239 -1.414 1.6e-01 Likelihood ratio test=22.5 on 3 df, p=5.05e-05 n= 312, number of events= 125 (106 observations deleted due to missingness)
The estimates for log-hazard ratio of the three covariates (trt, age, and female gender) are numerically very close. Any numerical differences reflect the different assumptions made about the baseline hazard: flexible spline (PGAM) v.s. piecewise exponential (Cox).
Increasing the number of nodes of the Lobatto rule does not materially affect the estimates of the PGAM:
GL<-GaussLobatto(10) > pbcGAM2<-GAMSurvdataset(GL,pbc,2,3) > fGAM2<-gam(gam.ev~s(stop,bs="cr")+trt+age+sex+offset(log(gam.dur)), + data=pbcGAM2,family="poisson",scale=1,method="REML") > > summary(fGAM2) Family: poisson Link function: log Formula: gam.ev ~ s(stop, bs = "cr") + trt + age + sex + offset(log(gam.dur)) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.345288 0.655177 -15.790 < 2e-16 *** trt 0.069553 0.181780 0.383 0.702 age 0.038487 0.008968 4.292 1.77e-05 *** sexf -0.370340 0.237723 -1.558 0.119 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df Chi.sq p-value s(stop) 1.003 1.005 4.163 0.0416 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = -0.124 Deviance explained = 1.7% -REML = 881.67 Scale est. = 1 n = 3120
Nevertheless, the estimates of the “baseline log-hazard” become more accurate (decreased standard errors and significance of the smooth term) as the number of nodes increases.
In simulations (see Fig 3) we show that the estimates of the hazard ratio generated by the GAM are comparable in bias, variance and coverage to those obtained by the Cox model. Even though this is an importance benchmark for the proposed method, it does not provide a compelling reason for replacing the Cox model with the PGAM. In fact, the advantages of the PGAM will only become apparent once we consider contexts which depend on the baseline hazard function, or problems in which the proportionality of hazards assumption is violated. So stay tuned.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.