Site icon R-bloggers

Survival Analysis with R

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, Survival Analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were known before 1700 [2]. In the early eighteenth century, the old masters, de Moivre working on annuities and Daniel Bernoulli studying competing risks for his work on smallpox inoculation, developed the foundations of time-to-event modeling [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing and Medicine and many more application areas. So, it is not surprising that the R Task View on Survival Analysis, a curated, organized and annotated list of relevant R packages and functions, is formidable.

Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger

The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R.

KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, Survival Analysis Techniques for Censored and Truncated Data.

My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.

ranger might be the surprise in my very short list of survival packages. The ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that ranger() also builds survival models. Benchmarks indicate that ranger() is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.

*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.

library(survival)
library(dplyr)
library(OIsurv) # Aumatically loads KMsurv
library(ranger)
library(ggplot2)

#------------
data(bmt)
# sapply(bmt,class)

# Prepare new data frame for modeling
bmt2 <- select(bmt,-c(t2,d2,d3))
bmt2  <- mutate(bmt2,
                group = as.factor(group),
                d1 = as.factor(d1),
                da = as.factor(da),
                dc = as.factor(dc),
                dp = as.factor(dp),
                z3 = as.factor(z3),
                z4 = as.factor(z4),
                z5 = as.factor(z5),
                z6 = as.factor(z6),
                z8 = as.factor(z8),
                z9 = as.factor(z9),
                z10 = as.factor(z10)
)
head(bmt2)
##   group   t1 d1   ta da  tc dc tp dp z1 z2 z3 z4 z5 z6   z7 z8 z9 z10
## 1     1 2081  0   67  1 121  1 13  1 26 33  1  0  1  1   98  0  1   0
## 2     1 1602  0 1602  0 139  1 18  1 21 37  1  1  0  0 1720  0  1   0
## 3     1 1496  0 1496  0 307  1 12  1 26 35  1  1  1  0  127  0  1   0
## 4     1 1462  0   70  1  95  1 13  1 17 21  0  1  0  0  168  0  1   0
## 5     1 1433  0 1433  0 236  1 12  1 32 36  1  1  1  1   93  0  1   0
## 6     1 1377  0 1377  0 123  1 12  1 22 31  1  1  1  1 2187  0  1   0

Kaplan Meier Analysis

The first thing to do is to use Surv() to build the standard survival object. The variable t1 records the time to death or the censored time; d1 indicates that the patient died (d1 = 1) or that the patient survived until the end of the study (d1 = 0). Note that a “+” after the time in the print out of y_bmt indicates censoring. The formula y_bmt ~ 1 instructs the survfit() function to fit a model with intercept only, and produces the Kaplan-Meier estimate.

The confBands() function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.

# Kaplan Meier Survival Curve
y_bmt <- Surv(bmt$t1, bmt$d1)
y_bmt
##   [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+  996+  226+ 1199+ 1111+
##  [12]  530+ 1182+ 1167+  418   417   276   156   781   172   487   716 
##  [23]  194   371   526   122  1279   110   243    86   466   262   162 
##  [34]  262     1   107   269   350  2569+ 2506+ 2409+ 2218+ 1857+ 1829+
##  [45] 1562+ 1470+ 1363+ 1030+  860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+
##  [56] 1568+ 1527+ 1324+  957+  932+  847+  848+ 1850+ 1843+ 1535+ 1447+
##  [67] 1384+  414  2204  1063   481   105   641   390   288   522    79 
##  [78] 1156   583    48   431  1074   393    10    53    80    35  1499+
##  [89]  704   653   222  1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+
## [100] 2024+ 1345+ 1136+  845+  491   162  1298   121     2    62   265 
## [111]  547   341   318   195   469    93   515   183   105   128   164 
## [122]  129   122    80   677    73   168    74    16   248   732   105 
## [133]  392    63    97   153   363
fit1_bmt <- survfit(y_bmt ~ 1)
summary(fit1_bmt)
## Call: survfit(formula = y_bmt ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    137       1    0.993 0.00727        0.979        1.000
##     2    136       1    0.985 0.01025        0.966        1.000
##    10    135       1    0.978 0.01250        0.954        1.000
##    16    134       1    0.971 0.01438        0.943        0.999
##    35    133       1    0.964 0.01602        0.933        0.995
##    48    132       1    0.956 0.01748        0.923        0.991
##    53    131       1    0.949 0.01881        0.913        0.987
##    62    130       1    0.942 0.02003        0.903        0.982
##    63    129       1    0.934 0.02117        0.894        0.977
##    73    128       1    0.927 0.02222        0.884        0.972
##    74    127       1    0.920 0.02322        0.875        0.966
##    79    126       1    0.912 0.02415        0.866        0.961
##    80    125       2    0.898 0.02588        0.848        0.950
##    86    123       1    0.891 0.02668        0.840        0.944
##    93    122       1    0.883 0.02744        0.831        0.939
##    97    121       1    0.876 0.02817        0.822        0.933
##   105    120       3    0.854 0.03017        0.797        0.915
##   107    117       1    0.847 0.03078        0.788        0.909
##   110    116       1    0.839 0.03137        0.780        0.903
##   121    115       1    0.832 0.03193        0.772        0.897
##   122    114       2    0.818 0.03300        0.755        0.885
##   128    112       1    0.810 0.03350        0.747        0.879
##   129    111       1    0.803 0.03399        0.739        0.872
##   153    110       1    0.796 0.03445        0.731        0.866
##   156    109       1    0.788 0.03490        0.723        0.860
##   162    108       2    0.774 0.03575        0.707        0.847
##   164    106       1    0.766 0.03615        0.699        0.841
##   168    105       1    0.759 0.03653        0.691        0.834
##   172    104       1    0.752 0.03690        0.683        0.828
##   183    103       1    0.745 0.03726        0.675        0.821
##   194    102       1    0.737 0.03760        0.667        0.815
##   195    101       1    0.730 0.03793        0.659        0.808
##   222    100       1    0.723 0.03825        0.651        0.802
##   243     98       1    0.715 0.03856        0.644        0.795
##   248     97       1    0.708 0.03886        0.636        0.788
##   262     96       2    0.693 0.03943        0.620        0.775
##   265     94       1    0.686 0.03969        0.612        0.768
##   269     93       1    0.678 0.03995        0.604        0.761
##   276     92       1    0.671 0.04019        0.597        0.755
##   288     91       1    0.664 0.04042        0.589        0.748
##   318     90       1    0.656 0.04063        0.581        0.741
##   341     89       1    0.649 0.04084        0.574        0.734
##   350     88       1    0.642 0.04104        0.566        0.727
##   363     87       1    0.634 0.04122        0.558        0.720
##   371     86       1    0.627 0.04140        0.551        0.713
##   390     85       1    0.619 0.04156        0.543        0.706
##   392     84       1    0.612 0.04172        0.535        0.699
##   393     83       1    0.605 0.04186        0.528        0.693
##   414     82       1    0.597 0.04199        0.520        0.686
##   417     81       1    0.590 0.04212        0.513        0.679
##   418     80       1    0.583 0.04223        0.505        0.671
##   431     79       1    0.575 0.04234        0.498        0.664
##   466     78       1    0.568 0.04243        0.490        0.657
##   469     77       1    0.560 0.04252        0.483        0.650
##   481     76       1    0.553 0.04259        0.476        0.643
##   487     75       1    0.546 0.04266        0.468        0.636
##   491     74       1    0.538 0.04271        0.461        0.629
##   515     73       1    0.531 0.04276        0.453        0.622
##   522     72       1    0.524 0.04280        0.446        0.615
##   526     71       1    0.516 0.04282        0.439        0.607
##   547     69       1    0.509 0.04285        0.431        0.600
##   583     68       1    0.501 0.04287        0.424        0.593
##   641     67       1    0.494 0.04288        0.416        0.585
##   653     66       1    0.486 0.04288        0.409        0.578
##   677     65       1    0.479 0.04286        0.402        0.571
##   704     64       1    0.471 0.04284        0.394        0.563
##   716     63       1    0.464 0.04281        0.387        0.556
##   732     62       1    0.456 0.04277        0.380        0.548
##   781     61       1    0.449 0.04272        0.372        0.541
##  1063     52       1    0.440 0.04276        0.364        0.533
##  1074     51       1    0.432 0.04278        0.355        0.524
##  1156     48       1    0.423 0.04282        0.346        0.515
##  1279     42       1    0.413 0.04297        0.336        0.506
##  1298     41       1    0.402 0.04308        0.326        0.496
##  2204      9       1    0.358 0.05696        0.262        0.489
cb <- confBands(y_bmt, type = "hall")
plot(fit1_bmt,
    main = 'Kaplan Meyer Plot with confidence bands')
lines(cb, col = "red",lty = 3)
legend(1000, 0.99, legend = c('K-M survival estimate',
'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)

Cox Proportional Hazards Model

Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s coxph() function, and show how the covariates enter the formula. Note, however, that this model does achieve an R2 value of 0.8, and that three variables – ta, tc, and dc1 – show up as significant.

# Fit Cox Model
form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp + 
                   z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10)

cox_bmt <- coxph(form,data = bmt2)
summary(cox_bmt)
## Call:
## coxph(formula = form, data = bmt2)
## 
##   n= 137, number of events= 81 
## 
##              coef  exp(coef)   se(coef)      z Pr(>|z|)    
## group2  0.2540084  1.2891826  0.4342118  0.585   0.5586    
## group3  0.5280287  1.6955865  0.4197516  1.258   0.2084    
## ta     -0.0017962  0.9982054  0.0003736 -4.808 1.52e-06 ***
## tc     -0.0060005  0.9940175  0.0010512 -5.708 1.14e-08 ***
## tp     -0.0005709  0.9994293  0.0012622 -0.452   0.6510    
## dc1    -3.5832062  0.0277865  0.4938320 -7.256 3.99e-13 ***
## dp1    -1.2266468  0.2932744  0.4832251 -2.538   0.0111 *  
## z1      0.0109077  1.0109674  0.0230293  0.474   0.6358    
## z2     -0.0155304  0.9845895  0.0200202 -0.776   0.4379    
## z31     0.0162146  1.0163468  0.2480277  0.065   0.9479    
## z41     0.0334753  1.0340419  0.2722755  0.123   0.9021    
## z51     0.0792485  1.0824733  0.2856339  0.277   0.7814    
## z61    -0.0702048  0.9322029  0.2557463 -0.275   0.7837    
## z7      0.0000440  1.0000440  0.0004235  0.104   0.9173    
## z81     0.4618598  1.5870228  0.3386689  1.364   0.1726    
## z101    0.5145619  1.6729055  0.3330479  1.545   0.1223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## group2   1.28918     0.7757   0.55044   3.01937
## group3   1.69559     0.5898   0.74478   3.86023
## ta       0.99821     1.0018   0.99747   0.99894
## tc       0.99402     1.0060   0.99197   0.99607
## tp       0.99943     1.0006   0.99696   1.00190
## dc1      0.02779    35.9887   0.01056   0.07314
## dp1      0.29327     3.4098   0.11375   0.75613
## z1       1.01097     0.9892   0.96635   1.05764
## z2       0.98459     1.0157   0.94670   1.02399
## z31      1.01635     0.9839   0.62506   1.65258
## z41      1.03404     0.9671   0.60642   1.76319
## z51      1.08247     0.9238   0.61842   1.89474
## z61      0.93220     1.0727   0.56470   1.53887
## z7       1.00004     1.0000   0.99921   1.00087
## z81      1.58702     0.6301   0.81716   3.08218
## z101     1.67291     0.5978   0.87092   3.21338
## 
## Concordance= 0.926  (se = 0.034 )
## Rsquare= 0.806   (max possible= 0.995 )
## Likelihood ratio test= 224.6  on 16 df,   p=0
## Wald test            = 107.5  on 16 df,   p=1.332e-15
## Score (logrank) test = 211.7  on 16 df,   p=0
cox_fit_bmt <- survfit(cox_bmt)
# plot(cox_fit_bmt)

Random Forests Model

To give some idea of the scope of R’s capabilities to work with time to event data, I use the ranger() function to fit a Random Forests Ensemble model to the data. Note that ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.

# ranger model

# ranger model
r_fit_bmt <- ranger(form,
                    data = bmt2,
                    importance = "permutation",
                    seed = 1234)

# Average the survival models
death_times <- r_fit_bmt$unique.death.times
surv_prob <- data.frame(r_fit_bmt$survival)
avg_prob <- sapply(surv_prob,mean)

# Plot the survival models for each patient
plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l", 
     ylim = c(0,1),
     col = "red",
     xlab = "death times",
     ylab = "survival",
     main = "Patient Survival Curves")

for(n in c(2:137)){
  lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red")
}
lines(death_times, avg_prob, lwd = 2)
legend(100, 0.2, legend = c('Averages - black'))

Here, we show the ranking of variable importance computed by the permutation method, which is ranger()’s default for survival data. Note that ta, tc, and dc are the same top three variables flagged in the Cox model.

Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)

vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE))
names(vi) <- "importance"
head(vi)
##    importance
## ta     0.1259
## tc     0.0688
## dc     0.0190
## tp     0.0117
## dp     0.0092
## z2     0.0046
cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error)
## Prediction Error = 1 - Harrell's c-index =  0.09304771

Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.

# Set up for ggplot
km <- rep("KM", length(fit1_bmt$time))
km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km)
names(km_df) <- c("Time","Surv","Model")

cox <- rep("Cox",length(cox_fit_bmt$time))
cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox)
names(cox_df) <- c("Time","Surv","Model")

rf <- rep("RF",length(r_fit_bmt$unique.death.times))
rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf)
names(rf_df) <- c("Time","Surv","Model")
plot_df <- rbind(km_df,cox_df,rf_df)

p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model))
p + geom_line() + ggtitle("Comparison of Survival Curves") 

For a very nice exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.

This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the Surv() function shows how open source code allows generations of developers to build on the work of their predecessors. The ranger packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.

References

For convenience, I have collected the references used throughout the post here.

[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed. p11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis (overview) Encyclopedia of Biostatistics 6. Wiley, p 4452-4461
[3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, 457–481, 562–563.
[4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.
[5] Diez, David. Survival Analysis in R. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) Survival Analysis Techniques for Censored and Truncated Data, Springer.
[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15, 361-387
[9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.