Cox proportional hazards with correlated variables
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1 Introduction
Cox proportional hazards (CoxPH) model is a common approach to study the occurrence of an event as a factor of time. In medical studies, CoxPH is used to model patient survival based on disease type, gene expression, or treatment with a new drug. But CoxPH can be also used to analyze real-world business data. CoxPH analysis enables you to answer questions like:
- Scientific questions:
- Do female patients live longer than males?
- Does drug A increase the overall patient survival time?
- Does the high levels of Gene A significantly correlate with the survival outcome?
- Business questions:
- Do women customers spend more time in my store than men?
- Which of my employees retain clients for the longest time?
- Does the age of my clients correlate with their long term brand loyalty?
- What is the risk of failure in the production line with the new machine compared to the old one?
One of the most powerful features of CoxPH analysis is that it allows including multiple predictors in the model, and these predictors (variables) can have different types. For instance, you can analyze continuous variables (age, gene expression, etc) along with categorical variables (male-vs-female, drug-vs-control, etc). You can also model interactions between two predictors to study complex relationships between predictors (this could be useful if you are trying to understand whether or not a drug can work function synergistically in males, for instance).
I’m in the middle of analyzing a skin cancer data set right now, and I’m trying to understand which genes correlate with the survival outcome. There are several other variables that correlate with my genes of interest (also known as problem of co-linearity), and these tend to have a bias for long-vs-short term survival. Let’s simulate a situation like this to understand how to interpret CoxPH results when there is co-linear variables.
2 Simulate data
We will create a dataset with 500 observations which have survival times (time
variable) between 50 and 1000 days. The death
variable indicates whether or not the patient is dead at the time of measurement. 1
indicates the patient has passed, and the 0
indicates patient is still alive at that time point, but we don’t have any further information. This is called censoring, and it can happen if the patient visited the hospital at some point (ie. alive) but didn’t came back, for instance.
gene
variable in the data frame shows the expression levels of an imaginary gene that is associated with increased death.
Code
# Set seed for reproducibility set.seed(123) # Number of observations n <- 500 # Max time for survival measurement maxtime <- 1000 # Simulate data frame. # Sort the death and gene variables to create easy correlation # ie higher gene expressors are dying more df <- data.frame(death = sort(rbinom(n, 1, 0.75)), # 0(25%)-1(75%) distribution time = runif(n=n, min = 50, max=maxtime), gene = sort(rnorm(n, mean=4, sd=1))+rnorm(n, mean=0, sd=1))
Let’s add some positively and negatively correlated genes now. I will do that by adding some noise to the original gene
variable. I created 4 different correlated variables:
gene_high_cor
: High positive correlationgene_low_cor
: Low positive correlationgene_inv_high_cor
: High negative (inverse) correlationgene_inv_high_cor
: Low negative correlation
Code
# Create high/low correlated genes df$gene_high_cor <- df$gene + rnorm(n, mean=0, sd=1) df$gene_low_cor <- df$gene + rnorm(n, mean=0, sd=3) # Create genes with high/low inverse correlation df$gene_inv_high_cor <- - 1* (df$gene + rnorm(n, mean=0, sd=1)) df$gene_inv_low_cor <- -1 * (df$gene + rnorm(n, mean=0, sd=3)) cor(df[, grepl("gene", colnames(df))])
gene gene_high_cor gene_low_cor gene_inv_high_cor gene 1.0000000 0.7871349 0.4808360 -0.8273054 gene_high_cor 0.7871349 1.0000000 0.3757030 -0.6695418 gene_low_cor 0.4808360 0.3757030 1.0000000 -0.3654133 gene_inv_high_cor -0.8273054 -0.6695418 -0.3654133 1.0000000 gene_inv_low_cor -0.4356404 -0.3487226 -0.2231452 0.3734353 gene_inv_low_cor gene -0.4356404 gene_high_cor -0.3487226 gene_low_cor -0.2231452 gene_inv_high_cor 0.3734353 gene_inv_low_cor 1.0000000
Code
head(df, 3)
death time gene gene_high_cor gene_low_cor gene_inv_high_cor 1 0 385.9258 0.1944266 -0.62656010 -1.3403846 0.4843810 2 0 398.1194 0.3583453 0.05108805 1.0691589 -0.9326580 3 0 322.7451 1.4741020 0.57200395 -0.1506656 -0.7695874 gene_inv_low_cor 1 0.2564958 2 0.6249261 3 2.8703939
Code
tail(df, 3)
death time gene gene_high_cor gene_low_cor gene_inv_high_cor 498 1 421.9238 7.594208 7.871655 6.963697 -7.165028 499 1 724.1009 7.773971 8.310827 12.127814 -8.962983 500 1 153.3829 6.528585 6.068099 8.453239 -7.362879 gene_inv_low_cor 498 -5.396508 499 -3.626073 500 -12.044975
Code
plot(df)
3 Univariable CoxPH analysis
Next step is using CoxPH. Details about the fit is given under specific columns:
coef
: Regression coefficients in the log odds scale. Positive values indicate increased risk of death.exp(coef)
: Hazard ratio (HR) which is equal to the exponentiated coefficient (). Above 1 indicates increased risk (HR=1.2 means 20% higher risk; whereas HR=0.8 means 20% reduced risk).se(coef)
: Standard error of the coefficient estimatez
: Wald-statistic which is equal tocoef/se(coef)
p
: P-value of the estimatesLikelihood ratio test
: Indicates the global p-values for rejecting the null hypothesis (ie. all of the coefficients in the model is zero) . You can see other statistics about the model by callingsummary()
function on the CoxPH results.
Code
library(survival) coxph(Surv(time,death) ~ gene, data = df)
Call: coxph(formula = Surv(time, death) ~ gene, data = df) coef exp(coef) se(coef) z p gene 0.21486 1.23969 0.03554 6.045 1.49e-09 Likelihood ratio test=35.84 on 1 df, p=2.138e-09 n= 500, number of events= 381
Code
coxph(Surv(time,death) ~ gene_high_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_high_cor, data = df) coef exp(coef) se(coef) z p gene_high_cor 0.15365 1.16609 0.03209 4.788 1.68e-06 Likelihood ratio test=22.76 on 1 df, p=1.839e-06 n= 500, number of events= 381
Code
coxph(Surv(time,death) ~ gene_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_low_cor, data = df) coef exp(coef) se(coef) z p gene_low_cor 0.05600 1.05760 0.01629 3.437 0.000589 Likelihood ratio test=11.72 on 1 df, p=0.0006199 n= 500, number of events= 381
Code
coxph(Surv(time,death) ~ gene_inv_high_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_inv_high_cor, data = df) coef exp(coef) se(coef) z p gene_inv_high_cor -0.12229 0.88489 0.02815 -4.344 1.4e-05 Likelihood ratio test=18.66 on 1 df, p=1.56e-05 n= 500, number of events= 381
Code
coxph(Surv(time,death) ~ gene_inv_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_inv_low_cor, data = df) coef exp(coef) se(coef) z p gene_inv_low_cor -0.06808 0.93418 0.01635 -4.165 3.12e-05 Likelihood ratio test=17.45 on 1 df, p=2.946e-05 n= 500, number of events= 381
4 Multivariable CoxPH analysis
After seeing how the result looks like above, next I’d like to try survival modeling using other correlated predictors alone or in combination with each other.
4.1 gene + gene_high_cor
Code
# covariable in combo coxph(Surv(time,death) ~ gene + gene_high_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene + gene_high_cor, data = df) coef exp(coef) se(coef) z p gene 0.20482 1.22731 0.05652 3.624 0.00029 gene_high_cor 0.01150 1.01157 0.05038 0.228 0.81939 Likelihood ratio test=35.9 on 2 df, p=1.604e-08 n= 500, number of events= 381
4.2 gene + gene_low_cor
Code
# covariable in combo coxph(Surv(time,death) ~ gene + gene_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene + gene_low_cor, data = df) coef exp(coef) se(coef) z p gene 0.20134 1.22304 0.04078 4.938 7.91e-07 gene_low_cor 0.01230 1.01237 0.01840 0.668 0.504 Likelihood ratio test=36.29 on 2 df, p=1.317e-08 n= 500, number of events= 381
4.3 gene_high_cor + gene_low_cor
Code
# covariables in combo coxph(Surv(time,death) ~ gene_high_cor + gene_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_high_cor + gene_low_cor, data = df) coef exp(coef) se(coef) z p gene_high_cor 0.13049 1.13939 0.03431 3.803 0.000143 gene_low_cor 0.03223 1.03276 0.01738 1.855 0.063579 Likelihood ratio test=26.19 on 2 df, p=2.056e-06 n= 500, number of events= 381
4.4 gene + gene_inv_high_cor
Code
# covariable in combo coxph(Surv(time,death) ~ gene + gene_inv_high_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene + gene_inv_high_cor, data = df) coef exp(coef) se(coef) z p gene 0.27951 1.32248 0.06439 4.341 1.42e-05 gene_inv_high_cor 0.06172 1.06366 0.05146 1.199 0.23 Likelihood ratio test=37.29 on 2 df, p=7.998e-09 n= 500, number of events= 381
4.5 gene + gene_inv_low_cor
Code
# covariable in combo coxph(Surv(time,death) ~ gene + gene_inv_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene + gene_inv_low_cor, data = df) coef exp(coef) se(coef) z p gene 0.18481 1.20299 0.04044 4.570 4.88e-06 gene_inv_low_cor -0.02845 0.97195 0.01842 -1.544 0.123 Likelihood ratio test=38.23 on 2 df, p=4.983e-09 n= 500, number of events= 381
4.6 gene_inv_high_cor + gene_inv_low_cor
Code
# covariables in combo coxph(Surv(time,death) ~ gene_inv_high_cor + gene_inv_low_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_inv_high_cor + gene_inv_low_cor, data = df) coef exp(coef) se(coef) z p gene_inv_high_cor -0.09271 0.91146 0.03033 -3.057 0.00224 gene_inv_low_cor -0.04936 0.95184 0.01745 -2.829 0.00467 Likelihood ratio test=26.68 on 2 df, p=1.606e-06 n= 500, number of events= 381
4.7 gene_high_cor + gene_inv_high_cor
Code
# covariables in combo coxph(Surv(time,death) ~ gene_high_cor + gene_inv_high_cor, data = df)
Call: coxph(formula = Surv(time, death) ~ gene_high_cor + gene_inv_high_cor, data = df) coef exp(coef) se(coef) z p gene_high_cor 0.10934 1.11554 0.04239 2.579 0.0099 gene_inv_high_cor -0.05931 0.94241 0.03714 -1.597 0.1103 Likelihood ratio test=25.31 on 2 df, p=3.194e-06 n= 500, number of events= 381
4.8 All covariables together
Code
formula_pred <- formula( paste("Surv(time,death) ~", paste( grep("gene", colnames(df), value =T), collapse = " + ") ) ) # covariables in combo coxph(formula_pred, data = df)
Call: coxph(formula = formula_pred, data = df) coef exp(coef) se(coef) z p gene 0.217201 1.242594 0.085836 2.530 0.0114 gene_high_cor 0.015962 1.016091 0.050472 0.316 0.7518 gene_low_cor 0.009508 1.009553 0.018484 0.514 0.6070 gene_inv_high_cor 0.053296 1.054742 0.052402 1.017 0.3091 gene_inv_low_cor -0.027322 0.973048 0.018442 -1.481 0.1385 Likelihood ratio test=39.8 on 5 df, p=1.635e-07 n= 500, number of events= 381
4.9 All covariables made random and modelled together
You will see why I’m doing this in the next section.
Code
# make all covariables random df[, 4:ncol(df)] <- rnorm(n*(ncol(df)-3), mean = 20, sd = 10) # covariables in combo coxph(formula_pred, data = df)
Call: coxph(formula = formula_pred, data = df) coef exp(coef) se(coef) z p gene 2.172e-01 1.243e+00 3.580e-02 6.069 1.29e-09 gene_high_cor -5.234e-03 9.948e-01 5.211e-03 -1.004 0.315 gene_low_cor -9.997e-05 9.999e-01 5.430e-03 -0.018 0.985 gene_inv_high_cor -6.686e-03 9.933e-01 4.841e-03 -1.381 0.167 gene_inv_low_cor 3.418e-03 1.003e+00 5.076e-03 0.673 0.501 Likelihood ratio test=39.02 on 5 df, p=2.35e-07 n= 500, number of events= 381
5 Take home point
Not unexpectedly, when there are correlated variables in the survival model, the coefficients of individual predictors can change depending on what enters the model. Let’s try to digest this a bit. In our simulated example, variable gene
is closely associated with the survival outcome (ie. patients with high gene
are dying more ). The other covariables are generated so that they are correlated with the variable gene
, but there is no mathematical relationship between these covariables and the survival outcome. In other words, higher levels of gene
go in parallel with higher chances for observing death = 1
. After capturing this relationship, the addition of random noise to gene
to obtain gene_high_cor
(and other covariables) does not explain the changes vital status any further. You might have two correlated variables gene
and gene_high_cor
for example, but only one of them may be associated with the survival outcome whereas the other one is just a derivative of the first one.
This is why you see a “significant” HR value of 1.16609
in the univariable model involving gene_high_cor
in Section 3 (keep in mind that this covariable is positively correlated with the variable of interest, gene
); but this significance goes away when you model gene_high_cor
together with gene
in Section 4.1. The same can be said for multivariable analyses involving other covariables.
Finally, when you model all predictors together in Section 4.8, you see that the HR values of all the covariables converge to 1 (ie. no effect on survival) whereas HR value of gene
is 1.242594
. However, when you look at the p-value, albeit it is still significant (as defined by < 0.05), you’ll see that it is higher now compared to univariable analysis. This may be happening because our covariables could still explain vital status changes at least somewhat. To support this notion, when we made all covariables random and performed survival modeling in Section 4.9, the coefficient and the p-value of gene
looked quite similar to the univariable model in Section 3. I hope this post helped you wraped your mind around how CoxPH works when there are correlated variables involved.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.