Survival Analysis with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, Survival Analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were known before 1700 [2]. In the early eighteenth century, the old masters, de Moivre working on annuities and Daniel Bernoulli studying competing risks for his work on smallpox inoculation, developed the foundations of time-to-event modeling [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing and Medicine and many more application areas. So, it is not surprising that the R Task View on Survival Analysis, a curated, organized and annotated list of relevant R packages and functions, is formidable.
Looking at the Task View on a small screen is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. (I would be remiss not to mention that we all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.) Newcomers, people either new to R or new to survival analysis or both, must find it overwhelming. So, it is with newcomers in mind that I offer the following slim trajectory through the task view that relies on just a few packages: survival, KMsurv, Oisurv and ranger
The survival package, which began life as an S package in the late ’90s, is the cornerstone of the entire R Survival Analysis edifice. Not only is the package itself rich in features, but the object created by the Surv()
function, which contains failure time and censoring information, is the basic survival analysis data structure in R.
KMsurv contains some interesting data sets from John Klein and Melvin Moeschberger’s classic text, Survival Analysis Techniques for Censored and Truncated Data.
My main reason for selecting the OIsurv package was to draw attention the very helpful guide to Survival Analysis in R, produced by the folks at OpenIntro.
ranger might be the surprise in my very short list of survival packages. The ranger()
function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But it was news to me that ranger()
also builds survival models. Benchmarks indicate that ranger()
is suitable for building time-to-event models with the large, high dimensional data sets important to internet marketing applications. Since ranger()
uses standard Surv()
survival objects, it’s an ideal tool for getting acquainted with survival analysis in the in this machine learning age.
*** Load and transform the data This first block of code loads the required packages along with the bone marrow transplant data frame from the KMsurv package. I chose this because has number of covariates and no missing values. The only data preparation is to make the appropriate variables into factors.
library(survival) library(dplyr) library(OIsurv) # Aumatically loads KMsurv library(ranger) library(ggplot2) #------------ data(bmt) # sapply(bmt,class) # Prepare new data frame for modeling bmt2 <- select(bmt,-c(t2,d2,d3)) bmt2 <- mutate(bmt2, group = as.factor(group), d1 = as.factor(d1), da = as.factor(da), dc = as.factor(dc), dp = as.factor(dp), z3 = as.factor(z3), z4 = as.factor(z4), z5 = as.factor(z5), z6 = as.factor(z6), z8 = as.factor(z8), z9 = as.factor(z9), z10 = as.factor(z10) ) head(bmt2) ## group t1 d1 ta da tc dc tp dp z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 ## 1 1 2081 0 67 1 121 1 13 1 26 33 1 0 1 1 98 0 1 0 ## 2 1 1602 0 1602 0 139 1 18 1 21 37 1 1 0 0 1720 0 1 0 ## 3 1 1496 0 1496 0 307 1 12 1 26 35 1 1 1 0 127 0 1 0 ## 4 1 1462 0 70 1 95 1 13 1 17 21 0 1 0 0 168 0 1 0 ## 5 1 1433 0 1433 0 236 1 12 1 32 36 1 1 1 1 93 0 1 0 ## 6 1 1377 0 1377 0 123 1 12 1 22 31 1 1 1 1 2187 0 1 0
Kaplan Meier Analysis
The first thing to do is to use Surv()
to build the standard survival object. The variable t1
records the time to death or the censored time; d1
indicates that the patient died (d1 = 1
) or that the patient survived until the end of the study (d1 = 0
). Note that a “+” after the time in the print out of y_bmt
indicates censoring. The formula y_bmt ~ 1
instructs the survfit()
function to fit a model with intercept only, and produces the Kaplan-Meier estimate.
The confBands()
function from the OIsurv package estimates Hall-Wellner confidence bands. These are large-sample, simultaneous estimates and are plotted in red. They contrast with the point-wise confidence bands rendered as black dashed lines.
# Kaplan Meier Survival Curve y_bmt <- Surv(bmt$t1, bmt$d1) y_bmt ## [1] 2081+ 1602+ 1496+ 1462+ 1433+ 1377+ 1330+ 996+ 226+ 1199+ 1111+ ## [12] 530+ 1182+ 1167+ 418 417 276 156 781 172 487 716 ## [23] 194 371 526 122 1279 110 243 86 466 262 162 ## [34] 262 1 107 269 350 2569+ 2506+ 2409+ 2218+ 1857+ 1829+ ## [45] 1562+ 1470+ 1363+ 1030+ 860+ 1258+ 2246+ 1870+ 1799+ 1709+ 1674+ ## [56] 1568+ 1527+ 1324+ 957+ 932+ 847+ 848+ 1850+ 1843+ 1535+ 1447+ ## [67] 1384+ 414 2204 1063 481 105 641 390 288 522 79 ## [78] 1156 583 48 431 1074 393 10 53 80 35 1499+ ## [89] 704 653 222 1356+ 2640+ 2430+ 2252+ 2140+ 2133+ 1238+ 1631+ ## [100] 2024+ 1345+ 1136+ 845+ 491 162 1298 121 2 62 265 ## [111] 547 341 318 195 469 93 515 183 105 128 164 ## [122] 129 122 80 677 73 168 74 16 248 732 105 ## [133] 392 63 97 153 363 fit1_bmt <- survfit(y_bmt ~ 1) summary(fit1_bmt) ## Call: survfit(formula = y_bmt ~ 1) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1 137 1 0.993 0.00727 0.979 1.000 ## 2 136 1 0.985 0.01025 0.966 1.000 ## 10 135 1 0.978 0.01250 0.954 1.000 ## 16 134 1 0.971 0.01438 0.943 0.999 ## 35 133 1 0.964 0.01602 0.933 0.995 ## 48 132 1 0.956 0.01748 0.923 0.991 ## 53 131 1 0.949 0.01881 0.913 0.987 ## 62 130 1 0.942 0.02003 0.903 0.982 ## 63 129 1 0.934 0.02117 0.894 0.977 ## 73 128 1 0.927 0.02222 0.884 0.972 ## 74 127 1 0.920 0.02322 0.875 0.966 ## 79 126 1 0.912 0.02415 0.866 0.961 ## 80 125 2 0.898 0.02588 0.848 0.950 ## 86 123 1 0.891 0.02668 0.840 0.944 ## 93 122 1 0.883 0.02744 0.831 0.939 ## 97 121 1 0.876 0.02817 0.822 0.933 ## 105 120 3 0.854 0.03017 0.797 0.915 ## 107 117 1 0.847 0.03078 0.788 0.909 ## 110 116 1 0.839 0.03137 0.780 0.903 ## 121 115 1 0.832 0.03193 0.772 0.897 ## 122 114 2 0.818 0.03300 0.755 0.885 ## 128 112 1 0.810 0.03350 0.747 0.879 ## 129 111 1 0.803 0.03399 0.739 0.872 ## 153 110 1 0.796 0.03445 0.731 0.866 ## 156 109 1 0.788 0.03490 0.723 0.860 ## 162 108 2 0.774 0.03575 0.707 0.847 ## 164 106 1 0.766 0.03615 0.699 0.841 ## 168 105 1 0.759 0.03653 0.691 0.834 ## 172 104 1 0.752 0.03690 0.683 0.828 ## 183 103 1 0.745 0.03726 0.675 0.821 ## 194 102 1 0.737 0.03760 0.667 0.815 ## 195 101 1 0.730 0.03793 0.659 0.808 ## 222 100 1 0.723 0.03825 0.651 0.802 ## 243 98 1 0.715 0.03856 0.644 0.795 ## 248 97 1 0.708 0.03886 0.636 0.788 ## 262 96 2 0.693 0.03943 0.620 0.775 ## 265 94 1 0.686 0.03969 0.612 0.768 ## 269 93 1 0.678 0.03995 0.604 0.761 ## 276 92 1 0.671 0.04019 0.597 0.755 ## 288 91 1 0.664 0.04042 0.589 0.748 ## 318 90 1 0.656 0.04063 0.581 0.741 ## 341 89 1 0.649 0.04084 0.574 0.734 ## 350 88 1 0.642 0.04104 0.566 0.727 ## 363 87 1 0.634 0.04122 0.558 0.720 ## 371 86 1 0.627 0.04140 0.551 0.713 ## 390 85 1 0.619 0.04156 0.543 0.706 ## 392 84 1 0.612 0.04172 0.535 0.699 ## 393 83 1 0.605 0.04186 0.528 0.693 ## 414 82 1 0.597 0.04199 0.520 0.686 ## 417 81 1 0.590 0.04212 0.513 0.679 ## 418 80 1 0.583 0.04223 0.505 0.671 ## 431 79 1 0.575 0.04234 0.498 0.664 ## 466 78 1 0.568 0.04243 0.490 0.657 ## 469 77 1 0.560 0.04252 0.483 0.650 ## 481 76 1 0.553 0.04259 0.476 0.643 ## 487 75 1 0.546 0.04266 0.468 0.636 ## 491 74 1 0.538 0.04271 0.461 0.629 ## 515 73 1 0.531 0.04276 0.453 0.622 ## 522 72 1 0.524 0.04280 0.446 0.615 ## 526 71 1 0.516 0.04282 0.439 0.607 ## 547 69 1 0.509 0.04285 0.431 0.600 ## 583 68 1 0.501 0.04287 0.424 0.593 ## 641 67 1 0.494 0.04288 0.416 0.585 ## 653 66 1 0.486 0.04288 0.409 0.578 ## 677 65 1 0.479 0.04286 0.402 0.571 ## 704 64 1 0.471 0.04284 0.394 0.563 ## 716 63 1 0.464 0.04281 0.387 0.556 ## 732 62 1 0.456 0.04277 0.380 0.548 ## 781 61 1 0.449 0.04272 0.372 0.541 ## 1063 52 1 0.440 0.04276 0.364 0.533 ## 1074 51 1 0.432 0.04278 0.355 0.524 ## 1156 48 1 0.423 0.04282 0.346 0.515 ## 1279 42 1 0.413 0.04297 0.336 0.506 ## 1298 41 1 0.402 0.04308 0.326 0.496 ## 2204 9 1 0.358 0.05696 0.262 0.489 cb <- confBands(y_bmt, type = "hall") plot(fit1_bmt, main = 'Kaplan Meyer Plot with confidence bands') lines(cb, col = "red",lty = 3) legend(1000, 0.99, legend = c('K-M survival estimate', 'pointwise intervals', 'Hall-Werner conf bands'), lty = 1:3)
Cox Proportional Hazards Model
Next, I fit a Cox Proportional Hazards model, which makes use of several of the variables contained in the data set. I don’t pretend to have the best model here, or even a very good one. My intention is just to show the survival package’s coxph()
function, and show how the covariates enter the formula. Note, however, that this model does achieve an R2 value of 0.8, and that three variables - ta
, tc
, and dc1
- show up as significant.
# Fit Cox Model form <- formula(y_bmt ~ group + ta + tc + tp + dc + dp + z1 + z2 + z3 + z4 + z5 + z6 + z7 + z8 + z10) cox_bmt <- coxph(form,data = bmt2) summary(cox_bmt) ## Call: ## coxph(formula = form, data = bmt2) ## ## n= 137, number of events= 81 ## ## coef exp(coef) se(coef) z Pr(>|z|) ## group2 0.2540084 1.2891826 0.4342118 0.585 0.5586 ## group3 0.5280287 1.6955865 0.4197516 1.258 0.2084 ## ta -0.0017962 0.9982054 0.0003736 -4.808 1.52e-06 *** ## tc -0.0060005 0.9940175 0.0010512 -5.708 1.14e-08 *** ## tp -0.0005709 0.9994293 0.0012622 -0.452 0.6510 ## dc1 -3.5832062 0.0277865 0.4938320 -7.256 3.99e-13 *** ## dp1 -1.2266468 0.2932744 0.4832251 -2.538 0.0111 * ## z1 0.0109077 1.0109674 0.0230293 0.474 0.6358 ## z2 -0.0155304 0.9845895 0.0200202 -0.776 0.4379 ## z31 0.0162146 1.0163468 0.2480277 0.065 0.9479 ## z41 0.0334753 1.0340419 0.2722755 0.123 0.9021 ## z51 0.0792485 1.0824733 0.2856339 0.277 0.7814 ## z61 -0.0702048 0.9322029 0.2557463 -0.275 0.7837 ## z7 0.0000440 1.0000440 0.0004235 0.104 0.9173 ## z81 0.4618598 1.5870228 0.3386689 1.364 0.1726 ## z101 0.5145619 1.6729055 0.3330479 1.545 0.1223 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## exp(coef) exp(-coef) lower .95 upper .95 ## group2 1.28918 0.7757 0.55044 3.01937 ## group3 1.69559 0.5898 0.74478 3.86023 ## ta 0.99821 1.0018 0.99747 0.99894 ## tc 0.99402 1.0060 0.99197 0.99607 ## tp 0.99943 1.0006 0.99696 1.00190 ## dc1 0.02779 35.9887 0.01056 0.07314 ## dp1 0.29327 3.4098 0.11375 0.75613 ## z1 1.01097 0.9892 0.96635 1.05764 ## z2 0.98459 1.0157 0.94670 1.02399 ## z31 1.01635 0.9839 0.62506 1.65258 ## z41 1.03404 0.9671 0.60642 1.76319 ## z51 1.08247 0.9238 0.61842 1.89474 ## z61 0.93220 1.0727 0.56470 1.53887 ## z7 1.00004 1.0000 0.99921 1.00087 ## z81 1.58702 0.6301 0.81716 3.08218 ## z101 1.67291 0.5978 0.87092 3.21338 ## ## Concordance= 0.926 (se = 0.034 ) ## Rsquare= 0.806 (max possible= 0.995 ) ## Likelihood ratio test= 224.6 on 16 df, p=0 ## Wald test = 107.5 on 16 df, p=1.332e-15 ## Score (logrank) test = 211.7 on 16 df, p=0 cox_fit_bmt <- survfit(cox_bmt) # plot(cox_fit_bmt)
Random Forests Model
To give some idea of the scope of R’s capabilities to work with time to event data, I use the ranger()
function to fit a Random Forests Ensemble model to the data. Note that ranger()
builds a model for each observation in the data set. The next block of code builds the model using the same variables used the the Cox model above, and each of the 137 survival curves computed for the bmt data set, along with a curve of average values.
# ranger model # ranger model r_fit_bmt <- ranger(form, data = bmt2, importance = "permutation", seed = 1234) # Average the survival models death_times <- r_fit_bmt$unique.death.times surv_prob <- data.frame(r_fit_bmt$survival) avg_prob <- sapply(surv_prob,mean) # Plot the survival models for each patient plot(r_fit_bmt$unique.death.times,r_fit_bmt$survival[1,], type = "l", ylim = c(0,1), col = "red", xlab = "death times", ylab = "survival", main = "Patient Survival Curves") for(n in c(2:137)){ lines(r_fit_bmt$unique.death.times, r_fit_bmt$survival[n,], type = "l", col = "red") } lines(death_times, avg_prob, lwd = 2) legend(100, 0.2, legend = c('Averages - black'))
Here, we show the ranking of variable importance computed by the permutation method, which is ranger()
’s default for survival data. Note that ta
, tc
, and dc
are the same top three variables flagged in the Cox model.
Also listed is a measure of prediction error calculated from Harrell’s c-index. This index is defined as “… the proportion of all usable patient pairs in which the predictions and outcomes are concordant” (cf. [8] p 370), where predictions for pairs are concordant if predicted survival times are larger for patients who lived longer. Note that Harrell’s c-index may be thought of as a generalization of finding the are under an ROC curve. (For binary outcomes Harrell’s c-index reduces to the Wilcoxon-Mann-Whitney statistic which, in turn, is equivalent to computing the area under the ROC curve.)
vi <- data.frame(sort(round(r_fit_bmt$variable.importance, 4), decreasing = TRUE)) names(vi) <- "importance" head(vi) ## importance ## ta 0.1259 ## tc 0.0688 ## dc 0.0190 ## tp 0.0117 ## dp 0.0092 ## z2 0.0046 cat("Prediction Error = 1 - Harrell's c-index = ", r_fit_bmt$prediction.error) ## Prediction Error = 1 - Harrell's c-index = 0.09304771
Finally, we plot the survival curves computed for all three models on the same graph. Note that the “ad hoc” curve of average survival curves computed by the ranger model tracks the Kaplan-Meier curve fairly well.
# Set up for ggplot km <- rep("KM", length(fit1_bmt$time)) km_df <- data.frame(fit1_bmt$time,fit1_bmt$surv,km) names(km_df) <- c("Time","Surv","Model") cox <- rep("Cox",length(cox_fit_bmt$time)) cox_df <- data.frame(cox_fit_bmt$time,cox_fit_bmt$surv,cox) names(cox_df) <- c("Time","Surv","Model") rf <- rep("RF",length(r_fit_bmt$unique.death.times)) rf_df <- data.frame(r_fit_bmt$unique.death.times,avg_prob,rf) names(rf_df) <- c("Time","Surv","Model") plot_df <- rbind(km_df,cox_df,rf_df) p <- ggplot(plot_df, aes(x = Time, y = Surv, color = Model)) p + geom_line() + ggtitle("Comparison of Survival Curves")
For a very nice exposition of the sort of predictive survival analysis modeling that can be done with ranger
, be sure to have a look at Manuel Amunategui’s post and video.
This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform which has been under continuous development and improvement for nearly twenty years. The use of the Surv()
function shows how open source code allows generations of developers to build on the work of their predecessors. The ranger
packages provides a practical example of how R can incorporate fast C++ code and adapt to the world of machine learning applications, and the incidental use of options such as Hall-Wellner Confidence bands and Harrell’s c-index gives some idea of the statistical depth that underlies almost everything R.
References
For convenience, I have collected the references used throughout the post here.
[1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed. p11
[2] Andersen, P.K., Keiding, N. (1998) Survival analysis (overview) Encyclopedia of Biostatistics 6. Wiley, p 4452-4461
[3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, 457–481, 562–563.
[4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, 187–220.
[5] Diez, David. Survival Analysis in R. OpenIntro [6] Klein, John P and Moeschberger, Melvin L. (1997) Survival Analysis Techniques for Censored and Truncated Data, Springer.
[7] Wright, Marvin & Ziegler, Andreas. (2017) * [ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R] (https://www.jstatsoft.org/article/view/v077i01)*, JSS Vol 77, Issue 1.
[8] Harrell, Frank, Lee, Kerry & Mark, Daniel. (1996) Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15, 361-387
[9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.