Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In a recent blog post we introduced 7 Interactive Bioinformatics Plots Made in Python and R.
Here I introduced 7 Interactive Plots from the Pharmaceutical Industry using the plotly R package. These plots are essential for any survival analysis study, where there is interest in time-to-events as often seen in the Pharmaceutical industry. For example, in clinical trials for determining drug efficacy.
Install Packages
We first install the required packages using the pacman R package which will install (if not already present), load, and update packages:
library(pacman) pacman::p_load(plotly) pacman::p_load(GGally) pacman::p_load(survival) pacman::p_load(cowplot) pacman::p_load(broom) pacman::p_load_current_gh("sahirbhatnagar/casebase") pacman::p_load(Epi) "%ni%" <- Negate("%in%")
1. Kaplan-Meier Curve
The Kaplan-Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data e.g. measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time [ref].
It is standard practice to plot the Kaplan-Meier (KM) curve as a first step in your analysis. We use the lung
data from the survival R package and the GGally R package to plot the KM curve with confidence bands (note that the red ticks represent events, which in this example are deaths):
data(lung, package = "survival") sf_lung <- survival::survfit(survival::Surv(time, status) ~ 1, data = lung) p1 <- GGally::ggsurv(sf_lung, main = "Kaplan-Meier Curve for the NCCTG Lung Cancer Data") plotly::ggplotly(p1)
2. Stratified Kaplan-Meier Curve with Log-rank test
Often we want to test if there is a difference between two or more survival curves. In this example, we want to see is there is a difference in survival time between men and women. We use the survival::survdiff
function to perform a log-rank test and annotate the plot with the corresponding p-value:
lung <- transform(lung, sex = factor(sex, levels = 1:2, labels = c("Male","Female"))) sf_sex <- survival::survfit(Surv(time, status) ~ sex, data = lung) pl_sex <- GGally::ggsurv(sf_sex, main = "Kaplan-Meier Curve for the NCCTG Lung Cancer Data Stratified by Sex") log_rank_sex <- survival::survdiff(Surv(time, status) ~ sex, data = lung) pl_sex_annotated <- pl_sex + ggplot2::geom_text(aes(label = sprintf("log-rank test p-value: %0.2g", pchisq(log_rank_sex$chisq, df = 1, lower.tail = F)), x = 750, y = 0.9)) plotly::ggplotly(pl_sex_annotated)
3. Lexis Plot
Lexis diagrams provide a graphical representation of the relationships between demographic events in time and persons at risk and they also assist in calculating rates. Every demographic event is characterized by two numbers: the time (e.g., year) at which it occurs and the age (or other duration measure) of the person to whom it occurs [ref]. Time is on the x-axis and age is on the y-axis. A line segment is drawn for each individual in the cohort representing how long they were under observation, while each point on the plot represents an event. No points are plotted if a subject did not experience an event. We use the Epi R package for the Epi::Lexis.diagram
function to create an appropriately formatted dataset and the nickel
data which is a cohort of nickel smelters in South Wales.
data(nickel, package = "Epi") attach(nickel) LL <- Lexis.diagram( age=c(10,100), date=c(1900,1990), entry.age=age1st, exit.age=ageout, birth.date=dob, fail=(icd %in% c(162,163)), lwd.life=1, cex.fail=0.8, col.fail=c("green","red") ) LL[nickel$icd %in% c(162,163),"cause"] <- "lung" LL[nickel$icd %in% c(160),"cause"] <- "nasal" LL[nickel$icd %ni% c(160,162,163), "cause"] <- "other" lex_plot <- ggplot(LL, aes(x=entry.date, xend=exit.date, y=entry.age, yend=exit.age)) + xlab("Calendar time") + ylab("Age") + labs(title = "Lexis Diagram of Nickel Smelting Workers in South Wales")+ scale_y_continuous(breaks = seq(10,100,10)) + scale_x_continuous(breaks = seq(1900,1990,10)) + geom_segment(size=.4, colour="grey") + geom_point(aes(x = exit.date, y = exit.age, color = cause), data = LL[LL$cause %in% c("lung","nasal"),]) + scale_color_brewer(palette = "Set1") + theme(legend.position = "bottom") + theme(legend.title = element_blank()) + background_grid(major = "xy", minor = "xy",colour.major = "grey") ggplotly(lex_plot)
4. Cox-Snell Residuals Plot
In any regression analysis, it is important to verify that the modeling assumptions are reasonable. We can do this by looking at the residuals. In parametric survival models we can plot each covariate against the Cox-Snell (CS) residuals (note that CS residuals can also be used for semi-parametric models such as the Cox model). I have provided a dataset in this GitHub Gist which contains survival times for leukemia patients. The times are in weeks from diagnosis and there are two covariates: white blood cell count (wbc) at diagnosis and a binary covariate AG that indicates a positive or negative (positve=1, negative=0) test related to white blood cell characteristics.
I first build a Weibull regression model with suitable covariates using the survival::survreg
function. Then I calculate the CS residuals and plot them against each covariate. The error distribution should follow an exponential distribution with mean 1, i.e., the points should be randomly scattered around the yintercept=1 line if the assumptions are appropriate.
leuk <- read.table("https://gist.githubusercontent.com/sahirbhatnagar/0026f614c55d75521662c06db92e9332/raw/2d8f77fe76c07746fb67593214460e306f907a24/leuk", header = TRUE) # check if weibull distribution is appropriate #Test interaction term fit <- survreg(Surv(time, status) ~ AG + wbc + log(wbc)+ wbc*AG, data = leuk, x = T) #Cox-Snell Residuals for failure/censored observations cox_snell_residuals <- as.numeric(exp((fit$y[,1] - fit$x %*% fit$coefficients)/fit$scale) + 1 - leuk$status) #Cox-Snell Residuals: Leukemia Data plotly::subplot( plot_ly(x = log(leuk$wbc), y = cox_snell_residuals, name = "log White Blood Cell Count"), plot_ly(x = leuk$wbc, y = cox_snell_residuals, name = "White Blood Cell Count"), plot_ly(x = leuk$wbc*leuk$AG, y = cox_snell_residuals, name = "AG*White Blood Cell Count"), plot_ly(x = leuk$AG, y = cox_snell_residuals, name = "AG"), nrows = 2 ) %>% layout(title = 'Cox-Snell Residuals for Weibull Regression Model of Leukemia Data')
5. Cox Model Coefficient Plot
In a Cox Regression analysis, the resulting coefficients are of interest to determine the magnitude of effect of that predictor on survival time. A coefficient plot makes it easy to see these effects. We use the survival::coxph
function to fit a Cox model, and the broom R package to extract the hazard ratios and 95% confidence intervals:
cfit <- survival::coxph(Surv(time, status) ~ age + sex + inst + wt.loss + log(ph.karno) + meal.cal, data = lung) d <- broom::tidy(cfit, exponentiate = TRUE) %>% arrange(desc(estimate)) %>% mutate(term = factor(term, levels = term)) plot_ly(d, x = ~estimate, y = ~term) %>% add_markers(error_x = ~list(value = std.error)) %>% layout(title = 'Cox Model Hazard Ratio Estimates and 95% CI for Lung Data')
6. Population Time Plots
In order to try and visualize the incidence density of an event, we can look at a population-time plots (somewhat similar to waterfall plots): on the X-axis we have time, and on the Y-axis we have the size of the risk set at a particular time point. Failure times associated to the event of interest can then be highlighted on the plot using red dots. We use the casebase R package by Sahir Bhatnagar and Maxime Turgeon to create these plots.
We can right away draw a few conclusions from this plot: first of all, we get a sense of how quickly the size of the risk set changes over time. We also see that the incidence density is non-constant: most events occur before time 20. Finally, we also see that the risk set keeps shrinking after the last event has occured; this could be due to either censoring or the competing event.
bmt <- read.csv("https://raw.githubusercontent.com/sahirbhatnagar/casebase/master/inst/extdata/bmtcrr.csv") obj <- popTime(bmt, time = "ftime") p6 <- plot(obj) p6 %>% layout(title = 'Population Time Plot for Stem Cell Transplant Data')
7. Stratified Population Time Plots
We can also create stratified Population time plots. This allows us to see if there is differential follow-up between two groups (in this example its disease type ALL vs. AML), and also if the incidence density is somewhat different between both groups as indicated by the distribution of the red dots. We can also clearly see that the AML group has more follow up time than the ALL group (based on the amount of grey area).
obj <- popTime(bmt, exposure = "D", time = "ftime") p7 <- plot(obj) ggplotly(p7) %>% layout(title = 'Disease Type Stratified Population Time Plot for Stem Cell Transplant Data')
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.