Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last post on modelling survival data from Veterinary Epidemiologic Research: parametric analyses. The Cox proportional hazards model described in the last post make no assumption about the shape of the baseline hazard, which is an advantage if you have no idea about what that shape might be. With a parametric survival model, the survival time is assumed to follow a known distribution: Weibull, exponential (which is a special case of the Weibull), log-logistic, log-normal, and generalized gamma.
Exponential Model
The exponential model is the simplest, the hazard
temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/pgtrial.rdata")) unlink(temp) library(Hmisc) pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id', tx = 'Treatment', lact = 'Lactation number', thin = 'Body condition', dar = 'Days at risk', preg = 'Pregnant or censored'), levels = list(thin = list('normal' = 0, 'thin' = 1), preg = list('censored' = 0, 'pregnant' = 1))) pgtrial$herd <- as.factor(pgtrial$herd) library(survival) exp.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + (lact - 1) + thin, data = pgtrial, dist = "exponential") summary(exp.mod) Call: survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + (lact - 1) + thin, data = pgtrial, dist = "exponential") Value Std. Error z p herd1 4.3629 0.1827 23.88 4.66e-126 herd2 4.6776 0.1711 27.34 1.41e-164 herd3 4.3253 0.1617 26.75 1.12e-157 tx -0.2178 0.1255 -1.74 8.26e-02 lact 0.0416 0.0413 1.01 3.14e-01 thinthin 0.1574 0.1383 1.14 2.55e-01 Scale fixed at 1 Exponential distribution Loglik(model)= -1459.9 Loglik(intercept only)= -1465.6 Chisq= 11.42 on 5 degrees of freedom, p= 0.044 Number of Newton-Raphson Iterations: 5 n= 319
Interpretation is the same as for a Cox model. Exponentiated coefficients are hazard ratios. R outputs the parameter estimates of the AFT (accelerated failure time) form of the exponential model. If you multiply the estimated coefficients by minus one you get estimates that are consistent with the proportional hazards parameterization of the model. So for tx, the estimated hazard ratio is exp(0.2178) = 1.24 (at any given point in time, a treated cow is 1.24 times more likely to conceive than a non-treated cow). The corresponding accelerating factor for an exponential model is the reciprocal of the hazard ratio, exp(-0.2178) = 0.80: treating a cow accelerates the time to conception by a factor of 0.80.
Weibull Model
In a Weibull model, the hazard function is
library(car) pgtrial$parity <- recode(pgtrial$lact, "1 = 1; 2:hi = '2+'") weib.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + parity + thin, data = pgtrial, dist = "weibull") summary(weib.mod) Call: survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + parity + thin, data = pgtrial, dist = "weibull") Value Std. Error z p (Intercept) 4.23053 0.1937 21.8412 9.42e-106 herd2 0.36117 0.1947 1.8548 6.36e-02 herd3 -0.00822 0.1980 -0.0415 9.67e-01 tx -0.23386 0.1438 -1.6262 1.04e-01 parity2+ 0.33819 0.1490 2.2698 2.32e-02 thinthin 0.11222 0.1576 0.7119 4.77e-01 Log(scale) 0.13959 0.0509 2.7407 6.13e-03 Scale= 1.15 Weibull distribution Loglik(model)= -1453.7 Loglik(intercept only)= -1460.7 Chisq= 14 on 5 degrees of freedom, p= 0.016 Number of Newton-Raphson Iterations: 5 n= 319
The shape parameter is the reciprocal of what is called by R the scale parameter. The shape parameter is then 1/1.15 = 0.869.
We can also use a piecewise constant exponential regression model, which is a model allowing the baseline hazard to vary between time periods but forces it to remain constant within time periods. In order to run such a model, we need data in a counting process format with a start and stop time for each interval. However, survreg does not allow for a data in that format. The trick would be to use a glm and fitting a Poisson model, including time intervals. See this post by Stephanie Kovalchik which explains how to construct the data and model. The example below is using the same approach, for a time interval of 40 days:
interval.width <- 40 # function to compute time breaks given the exit time = dar cow.breaks <- function(dar) unique(c(seq(0, dar, by = interval.width), dar)) # list of each subject's time periods the.breaks <- lapply(unique(pgtrial$cow), function(id){ cow.breaks(max(pgtrial$dar[pgtrial$cow == id])) }) # the expanded period of observation: start <- lapply(the.breaks, function(x) x[-length(x)]) # for left time points stop <- lapply(the.breaks, function(x) x[-1]) # for right time points count.per.cow <- sapply(start, length) index <- tapply(pgtrial$cow, pgtrial$cow, length) index <- cumsum(index) # index of last observation for each cow event <- rep(0, sum(count.per.cow)) event[cumsum(count.per.cow)] <- pgtrial$preg[index] # creating the expanded dataset pw.pgtrial <- data.frame( cow = rep(pgtrial$cow[index], count.per.cow), dar = rep(pgtrial$dar[index], count.per.cow), herd = rep(pgtrial$herd[index], count.per.cow), tx = rep(pgtrial$tx[index], count.per.cow), lact = rep(pgtrial$lact[index], count.per.cow), thin = rep(pgtrial$thin[index], count.per.cow), start = unlist(start), stop = unlist(stop), event = event ) # create time variable which indicates the period of observation (offset in Poisson model) pw.pgtrial$time <- pw.pgtrial$stop - pw.pgtrial$start # length of observation # create a factor for each interval, allowing to have a different rate for each period pw.pgtrial$interval <- factor(pw.pgtrial$start) pw.pgtrial[100:110, ] cow dar herd tx lact thin start stop event time interval 100 61 113 1 1 4 thin 0 40 0 40 0 101 61 113 1 1 4 thin 40 80 0 40 40 102 61 113 1 1 4 thin 80 113 1 33 80 103 62 117 1 0 7 normal 0 40 0 40 0 104 62 117 1 0 7 normal 40 80 0 40 40 105 62 117 1 0 7 normal 80 117 2 37 80 106 63 121 1 1 1 thin 0 40 0 40 0 107 63 121 1 1 1 thin 40 80 0 40 40 108 63 121 1 1 1 thin 80 120 0 40 80 109 63 121 1 1 1 thin 120 121 2 1 120 110 64 122 1 1 3 normal 0 40 0 40 0 # Poisson model pw.model <- glm(event ~ offset(log(time)) + interval + herd + tx + lact + + thin, data = pw.pgtrial, family = "poisson") summary(pw.model) Call: glm(formula = event ~ offset(log(time)) + interval + herd + tx + lact + thin, family = "poisson", data = pw.pgtrial) Deviance Residuals: Min 1Q Median 3Q Max -1.858 -1.373 -1.227 1.357 3.904 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.602545 0.132436 -27.202 < 2e-16 *** interval40 -0.112838 0.106807 -1.056 0.29076 interval80 -0.064105 0.125396 -0.511 0.60920 interval120 -0.007682 0.147919 -0.052 0.95858 interval160 -0.005743 0.191778 -0.030 0.97611 interval200 -0.427775 0.309143 -1.384 0.16644 interval240 0.199904 0.297331 0.672 0.50137 interval280 0.737508 0.385648 1.912 0.05583 . interval320 0.622366 1.006559 0.618 0.53637 herd2 -0.254389 0.114467 -2.222 0.02626 * herd3 0.026851 0.119416 0.225 0.82209 tx 0.219584 0.084824 2.589 0.00963 ** lact -0.023528 0.027511 -0.855 0.39241 thinthin -0.139915 0.093632 -1.494 0.13509 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2155.6 on 798 degrees of freedom Residual deviance: 2131.1 on 785 degrees of freedom AIC: 2959.1 Number of Fisher Scoring iterations: 7
Log-logistic Model
loglog.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") summary(loglog.mod) Call: survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx + lact + thin, data = pgtrial, dist = "loglogistic") Value Std. Error z p (Intercept) 3.9544 0.2531 15.625 4.91e-55 herd2 0.2537 0.2355 1.077 2.81e-01 herd3 -0.1019 0.2437 -0.418 6.76e-01 tx -0.3869 0.1768 -2.189 2.86e-02 lact 0.0612 0.0550 1.112 2.66e-01 thinthin 0.0400 0.1894 0.211 8.33e-01 Log(scale) -0.1260 0.0515 -2.447 1.44e-02 Scale= 0.882 Log logistic distribution Loglik(model)= -1467.2 Loglik(intercept only)= -1472.2 Chisq= 9.85 on 5 degrees of freedom, p= 0.08 Number of Newton-Raphson Iterations: 4 n= 319
Individual Frailty Model
In an individual frailty model, we add variance unique to individuals in order to account for additional variability in the hazard (like negative binomial model vs. Poisson model). For example, let’s fit a Weibull model with gamma individual frailty to the prostaglandin dataset:
library(parfm) pgtrial$preg.bin <- as.numeric(pgtrial$preg) - 1 indfr.mod <- parfm(Surv(dar, preg.bin) ~ herd + tx + lact + thin, cluster = "cow", data = pgtrial, dist = "weibull", frailty = "gamma") Execution time: 17.872 second(s) indfr.mod Frailty distribution: gamma Baseline hazard distribution: Weibull Loglikelihood: -1455.679 ESTIMATE SE p-val theta 0.000 0.003 rho 0.867 0.044 lambda 0.024 0.006 herd2 -0.289 0.169 0.088 . herd3 0.039 0.175 0.824 tx 0.204 0.125 0.103 lact -0.041 0.041 0.314 thinthin -0.136 0.138 0.323 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Shared Frailty
Shared frailty is a way to deal with clustered data. We will use the “culling” dataset and fit a shared frailty model with a Weibull distribution and a gamma distributed frailty common to all cows in a herd:
temp <- tempfile() download.file( "http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp) load(unz(temp, "ver2_data_R/culling.rdata")) unlink(temp) library(frailtypack) shfrw.mod <- frailtyPenal(Surv(dar, culled) ~ as.factor(lact_c3) + johnes + cluster(herd), hazard = 'Weibull', data = culling, Frailty = TRUE) shfrw.sum <- cbind(shfrw.mod$coef, sqrt(diag(shfrw.mod$varH)), shfrw.mod$coef / sqrt(diag(shfrw.mod$varH)), signif(1 - pchisq((shfrw.mod$coef/sqrt(diag(shfrw.mod$varH)))^2, 1)), exp(shfrw.mod$coef), exp(shfrw.mod$coef - abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))), exp(shfrw.mod$coef + abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH)))) row.names(shfrw.sum) <- c("Lactation 2", "Lactation 3+", "Johnes") colnames(shfrw.sum) <- c("Coef.", "Std. Err.", "z", "p-value", "Hazard Ratio", "Lower CI", "Upper CI") shfrw.sum Coef. Std. Err. z p-value Hazard Ratio Lower CI Lactation 2 0.2518627 0.1450806 1.736019 8.25605e-02 1.286419 0.9680321 Lactation 3+ 0.7636558 0.1227840 6.219508 4.98717e-10 2.146108 1.6870874 Johnes 0.5914741 0.3045475 1.942141 5.21200e-02 1.806650 0.9945867 Upper CI Lactation 2 1.709525 Lactation 3+ 2.730017 Johnes 3.281748
That’s it for reproducing the examples from Dohoo’s book, chapter on modelling survival data. Next time I’ll look at mixed models.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.