Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last post on modelling survival data from Veterinary Epidemiologic Research: parametric analyses. The Cox proportional hazards model described in the last post make no assumption about the shape of the baseline hazard, which is an advantage if you have no idea about what that shape might be. With a parametric survival model, the survival time is assumed to follow a known distribution: Weibull, exponential (which is a special case of the Weibull), log-logistic, log-normal, and generalized gamma.
Exponential Model
The exponential model is the simplest, the hazard
temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/pgtrial.rdata"))
unlink(temp)
library(Hmisc)
pgtrial <- upData(pgtrial, labels = c(herd = 'Herd id', cow = 'Cow id',
tx = 'Treatment', lact = 'Lactation number',
thin = 'Body condition', dar = 'Days at risk',
preg = 'Pregnant or censored'),
levels = list(thin = list('normal' = 0, 'thin' = 1),
preg = list('censored' = 0, 'pregnant' = 1)))
pgtrial$herd <- as.factor(pgtrial$herd)
library(survival)
exp.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + (lact - 1) +
thin, data = pgtrial, dist = "exponential")
summary(exp.mod)
Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
(lact - 1) + thin, data = pgtrial, dist = "exponential")
Value Std. Error z p
herd1 4.3629 0.1827 23.88 4.66e-126
herd2 4.6776 0.1711 27.34 1.41e-164
herd3 4.3253 0.1617 26.75 1.12e-157
tx -0.2178 0.1255 -1.74 8.26e-02
lact 0.0416 0.0413 1.01 3.14e-01
thinthin 0.1574 0.1383 1.14 2.55e-01
Scale fixed at 1
Exponential distribution
Loglik(model)= -1459.9 Loglik(intercept only)= -1465.6
Chisq= 11.42 on 5 degrees of freedom, p= 0.044
Number of Newton-Raphson Iterations: 5
n= 319
Interpretation is the same as for a Cox model. Exponentiated coefficients are hazard ratios. R outputs the parameter estimates of the AFT (accelerated failure time) form of the exponential model. If you multiply the estimated coefficients by minus one you get estimates that are consistent with the proportional hazards parameterization of the model. So for tx, the estimated hazard ratio is exp(0.2178) = 1.24 (at any given point in time, a treated cow is 1.24 times more likely to conceive than a non-treated cow). The corresponding accelerating factor for an exponential model is the reciprocal of the hazard ratio, exp(-0.2178) = 0.80: treating a cow accelerates the time to conception by a factor of 0.80.
Weibull Model
In a Weibull model, the hazard function is
library(car)
pgtrial$parity <- recode(pgtrial$lact, "1 = 1; 2:hi = '2+'")
weib.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + parity +
thin, data = pgtrial, dist = "weibull")
summary(weib.mod)
Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
parity + thin, data = pgtrial, dist = "weibull")
Value Std. Error z p
(Intercept) 4.23053 0.1937 21.8412 9.42e-106
herd2 0.36117 0.1947 1.8548 6.36e-02
herd3 -0.00822 0.1980 -0.0415 9.67e-01
tx -0.23386 0.1438 -1.6262 1.04e-01
parity2+ 0.33819 0.1490 2.2698 2.32e-02
thinthin 0.11222 0.1576 0.7119 4.77e-01
Log(scale) 0.13959 0.0509 2.7407 6.13e-03
Scale= 1.15
Weibull distribution
Loglik(model)= -1453.7 Loglik(intercept only)= -1460.7
Chisq= 14 on 5 degrees of freedom, p= 0.016
Number of Newton-Raphson Iterations: 5
n= 319
The shape parameter is the reciprocal of what is called by R the scale parameter. The shape parameter is then 1/1.15 = 0.869.
We can also use a piecewise constant exponential regression model, which is a model allowing the baseline hazard to vary between time periods but forces it to remain constant within time periods. In order to run such a model, we need data in a counting process format with a start and stop time for each interval. However, survreg does not allow for a data in that format. The trick would be to use a glm and fitting a Poisson model, including time intervals. See this post by Stephanie Kovalchik which explains how to construct the data and model. The example below is using the same approach, for a time interval of 40 days:
interval.width <- 40
# function to compute time breaks given the exit time = dar
cow.breaks <- function(dar) unique(c(seq(0, dar, by = interval.width),
dar))
# list of each subject's time periods
the.breaks <- lapply(unique(pgtrial$cow), function(id){
cow.breaks(max(pgtrial$dar[pgtrial$cow == id]))
})
# the expanded period of observation:
start <- lapply(the.breaks, function(x) x[-length(x)]) # for left time points
stop <- lapply(the.breaks, function(x) x[-1]) # for right time points
count.per.cow <- sapply(start, length)
index <- tapply(pgtrial$cow, pgtrial$cow, length)
index <- cumsum(index) # index of last observation for each cow
event <- rep(0, sum(count.per.cow))
event[cumsum(count.per.cow)] <- pgtrial$preg[index]
# creating the expanded dataset
pw.pgtrial <- data.frame(
cow = rep(pgtrial$cow[index], count.per.cow),
dar = rep(pgtrial$dar[index], count.per.cow),
herd = rep(pgtrial$herd[index], count.per.cow),
tx = rep(pgtrial$tx[index], count.per.cow),
lact = rep(pgtrial$lact[index], count.per.cow),
thin = rep(pgtrial$thin[index], count.per.cow),
start = unlist(start),
stop = unlist(stop),
event = event
)
# create time variable which indicates the period of observation (offset in Poisson model)
pw.pgtrial$time <- pw.pgtrial$stop - pw.pgtrial$start # length of observation
# create a factor for each interval, allowing to have a different rate for each period
pw.pgtrial$interval <- factor(pw.pgtrial$start)
pw.pgtrial[100:110, ]
cow dar herd tx lact thin start stop event time interval
100 61 113 1 1 4 thin 0 40 0 40 0
101 61 113 1 1 4 thin 40 80 0 40 40
102 61 113 1 1 4 thin 80 113 1 33 80
103 62 117 1 0 7 normal 0 40 0 40 0
104 62 117 1 0 7 normal 40 80 0 40 40
105 62 117 1 0 7 normal 80 117 2 37 80
106 63 121 1 1 1 thin 0 40 0 40 0
107 63 121 1 1 1 thin 40 80 0 40 40
108 63 121 1 1 1 thin 80 120 0 40 80
109 63 121 1 1 1 thin 120 121 2 1 120
110 64 122 1 1 3 normal 0 40 0 40 0
# Poisson model
pw.model <- glm(event ~ offset(log(time)) + interval + herd + tx + lact +
+ thin, data = pw.pgtrial, family = "poisson")
summary(pw.model)
Call:
glm(formula = event ~ offset(log(time)) + interval + herd + tx +
lact + thin, family = "poisson", data = pw.pgtrial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.858 -1.373 -1.227 1.357 3.904
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.602545 0.132436 -27.202 < 2e-16 ***
interval40 -0.112838 0.106807 -1.056 0.29076
interval80 -0.064105 0.125396 -0.511 0.60920
interval120 -0.007682 0.147919 -0.052 0.95858
interval160 -0.005743 0.191778 -0.030 0.97611
interval200 -0.427775 0.309143 -1.384 0.16644
interval240 0.199904 0.297331 0.672 0.50137
interval280 0.737508 0.385648 1.912 0.05583 .
interval320 0.622366 1.006559 0.618 0.53637
herd2 -0.254389 0.114467 -2.222 0.02626 *
herd3 0.026851 0.119416 0.225 0.82209
tx 0.219584 0.084824 2.589 0.00963 **
lact -0.023528 0.027511 -0.855 0.39241
thinthin -0.139915 0.093632 -1.494 0.13509
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2155.6 on 798 degrees of freedom
Residual deviance: 2131.1 on 785 degrees of freedom
AIC: 2959.1
Number of Fisher Scoring iterations: 7
Log-logistic Model
loglog.mod <- survreg(Surv(dar, preg == 'pregnant') ~ herd + tx + lact +
thin, data = pgtrial, dist = "loglogistic")
summary(loglog.mod)
Call:
survreg(formula = Surv(dar, preg == "pregnant") ~ herd + tx +
lact + thin, data = pgtrial, dist = "loglogistic")
Value Std. Error z p
(Intercept) 3.9544 0.2531 15.625 4.91e-55
herd2 0.2537 0.2355 1.077 2.81e-01
herd3 -0.1019 0.2437 -0.418 6.76e-01
tx -0.3869 0.1768 -2.189 2.86e-02
lact 0.0612 0.0550 1.112 2.66e-01
thinthin 0.0400 0.1894 0.211 8.33e-01
Log(scale) -0.1260 0.0515 -2.447 1.44e-02
Scale= 0.882
Log logistic distribution
Loglik(model)= -1467.2 Loglik(intercept only)= -1472.2
Chisq= 9.85 on 5 degrees of freedom, p= 0.08
Number of Newton-Raphson Iterations: 4
n= 319
Individual Frailty Model
In an individual frailty model, we add variance unique to individuals in order to account for additional variability in the hazard (like negative binomial model vs. Poisson model). For example, let’s fit a Weibull model with gamma individual frailty to the prostaglandin dataset:
library(parfm)
pgtrial$preg.bin <- as.numeric(pgtrial$preg) - 1
indfr.mod <- parfm(Surv(dar, preg.bin) ~ herd + tx + lact + thin,
cluster = "cow", data = pgtrial, dist = "weibull",
frailty = "gamma")
Execution time: 17.872 second(s)
indfr.mod
Frailty distribution: gamma
Baseline hazard distribution: Weibull
Loglikelihood: -1455.679
ESTIMATE SE p-val
theta 0.000 0.003
rho 0.867 0.044
lambda 0.024 0.006
herd2 -0.289 0.169 0.088 .
herd3 0.039 0.175 0.824
tx 0.204 0.125 0.103
lact -0.041 0.041 0.314
thinthin -0.136 0.138 0.323
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Shared Frailty
Shared frailty is a way to deal with clustered data. We will use the “culling” dataset and fit a shared frailty model with a Weibull distribution and a gamma distributed frailty common to all cows in a herd:
temp <- tempfile()
download.file(
"http://ic.upei.ca/ver/sites/ic.upei.ca.ver/files/ver2_data_R.zip", temp)
load(unz(temp, "ver2_data_R/culling.rdata"))
unlink(temp)
library(frailtypack)
shfrw.mod <- frailtyPenal(Surv(dar, culled) ~ as.factor(lact_c3) + johnes +
cluster(herd),
hazard = 'Weibull', data = culling, Frailty = TRUE)
shfrw.sum <- cbind(shfrw.mod$coef, sqrt(diag(shfrw.mod$varH)),
shfrw.mod$coef / sqrt(diag(shfrw.mod$varH)),
signif(1 - pchisq((shfrw.mod$coef/sqrt(diag(shfrw.mod$varH)))^2, 1)),
exp(shfrw.mod$coef),
exp(shfrw.mod$coef - abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))),
exp(shfrw.mod$coef + abs(qnorm((1 - 0.95) / 2)) * sqrt(diag(shfrw.mod$varH))))
row.names(shfrw.sum) <- c("Lactation 2", "Lactation 3+", "Johnes")
colnames(shfrw.sum) <- c("Coef.", "Std. Err.", "z", "p-value", "Hazard Ratio",
"Lower CI", "Upper CI")
shfrw.sum
Coef. Std. Err. z p-value Hazard Ratio Lower CI
Lactation 2 0.2518627 0.1450806 1.736019 8.25605e-02 1.286419 0.9680321
Lactation 3+ 0.7636558 0.1227840 6.219508 4.98717e-10 2.146108 1.6870874
Johnes 0.5914741 0.3045475 1.942141 5.21200e-02 1.806650 0.9945867
Upper CI
Lactation 2 1.709525
Lactation 3+ 2.730017
Johnes 3.281748
That’s it for reproducing the examples from Dohoo’s book, chapter on modelling survival data. Next time I’ll look at mixed models.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
