Site icon R-bloggers

Cox proportional hazards with correlated variables

[This article was first published on Dr. Atakan Ekiz, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- Google Tag Manager (noscript) --> < !-- End Google Tag Manager (noscript) -->

Image by Andreas (Pixabay)
< section id="introduction" class="level2" data-number="1">

1 Introduction

Cox proportional hazards (CoxPH) model is a common approach to study the occurrence of an event as a factor of time. In medical studies, CoxPH is used to model patient survival based on disease type, gene expression, or treatment with a new drug. But CoxPH can be also used to analyze real-world business data. CoxPH analysis enables you to answer questions like:

One of the most powerful features of CoxPH analysis is that it allows including multiple predictors in the model, and these predictors (variables) can have different types. For instance, you can analyze continuous variables (age, gene expression, etc) along with categorical variables (male-vs-female, drug-vs-control, etc). You can also model interactions between two predictors to study complex relationships between predictors (this could be useful if you are trying to understand whether or not a drug can work function synergistically in males, for instance).

I’m in the middle of analyzing a skin cancer data set right now, and I’m trying to understand which genes correlate with the survival outcome. There are several other variables that correlate with my genes of interest (also known as problem of co-linearity), and these tend to have a bias for long-vs-short term survival. Let’s simulate a situation like this to understand how to interpret CoxPH results when there is co-linear variables.

< section id="simulate-data" class="level2" data-number="2">

2 Simulate data

We will create a dataset with 500 observations which have survival times (time variable) between 50 and 1000 days. The death variable indicates whether or not the patient is dead at the time of measurement. 1 indicates the patient has passed, and the 0 indicates patient is still alive at that time point, but we don’t have any further information. This is called censoring, and it can happen if the patient visited the hospital at some point (ie. alive) but didn’t came back, for instance.

gene variable in the data frame shows the expression levels of an imaginary gene that is associated with increased death.

< details open=""> < summary>Code
# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 500

# Max time for survival measurement
maxtime <- 1000


# Simulate data frame. 
# Sort the death and gene variables to create easy correlation
# ie higher gene expressors are dying more 
df <- data.frame(death = sort(rbinom(n, 1, 0.75)), # 0(25%)-1(75%) distribution
                 time = runif(n=n, min = 50, max=maxtime),
                 gene = sort(rnorm(n, mean=4, sd=1))+rnorm(n, mean=0, sd=1))

Let’s add some positively and negatively correlated genes now. I will do that by adding some noise to the original gene variable. I created 4 different correlated variables:

< details open=""> < summary>Code
# Create high/low correlated genes
df$gene_high_cor <- df$gene + rnorm(n, mean=0, sd=1)
df$gene_low_cor <- df$gene + rnorm(n, mean=0, sd=3)

# Create genes with high/low inverse correlation 
df$gene_inv_high_cor <- - 1* (df$gene + rnorm(n, mean=0, sd=1))
df$gene_inv_low_cor <- -1 * (df$gene + rnorm(n, mean=0, sd=3))


cor(df[, grepl("gene", colnames(df))])
                        gene gene_high_cor gene_low_cor gene_inv_high_cor
gene               1.0000000     0.7871349    0.4808360        -0.8273054
gene_high_cor      0.7871349     1.0000000    0.3757030        -0.6695418
gene_low_cor       0.4808360     0.3757030    1.0000000        -0.3654133
gene_inv_high_cor -0.8273054    -0.6695418   -0.3654133         1.0000000
gene_inv_low_cor  -0.4356404    -0.3487226   -0.2231452         0.3734353
                  gene_inv_low_cor
gene                    -0.4356404
gene_high_cor           -0.3487226
gene_low_cor            -0.2231452
gene_inv_high_cor        0.3734353
gene_inv_low_cor         1.0000000
< details open=""> < summary>Code
head(df, 3)
  death     time      gene gene_high_cor gene_low_cor gene_inv_high_cor
1     0 385.9258 0.1944266   -0.62656010   -1.3403846         0.4843810
2     0 398.1194 0.3583453    0.05108805    1.0691589        -0.9326580
3     0 322.7451 1.4741020    0.57200395   -0.1506656        -0.7695874
  gene_inv_low_cor
1        0.2564958
2        0.6249261
3        2.8703939
< details open=""> < summary>Code
tail(df, 3)
    death     time     gene gene_high_cor gene_low_cor gene_inv_high_cor
498     1 421.9238 7.594208      7.871655     6.963697         -7.165028
499     1 724.1009 7.773971      8.310827    12.127814         -8.962983
500     1 153.3829 6.528585      6.068099     8.453239         -7.362879
    gene_inv_low_cor
498        -5.396508
499        -3.626073
500       -12.044975
< details open=""> < summary>Code
plot(df)

< section id="sec-univar" class="level2" data-number="3">

3 Univariable CoxPH analysis

Next step is using CoxPH. Details about the fit is given under specific columns:

< details open=""> < summary>Code
library(survival)

coxph(Surv(time,death) ~ gene, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene, data = df)

        coef exp(coef) se(coef)     z        p
gene 0.21486   1.23969  0.03554 6.045 1.49e-09

Likelihood ratio test=35.84  on 1 df, p=2.138e-09
n= 500, number of events= 381 
< details open=""> < summary>Code
coxph(Surv(time,death) ~ gene_high_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_high_cor, data = df)

                 coef exp(coef) se(coef)     z        p
gene_high_cor 0.15365   1.16609  0.03209 4.788 1.68e-06

Likelihood ratio test=22.76  on 1 df, p=1.839e-06
n= 500, number of events= 381 
< details open=""> < summary>Code
coxph(Surv(time,death) ~ gene_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_low_cor, data = df)

                coef exp(coef) se(coef)     z        p
gene_low_cor 0.05600   1.05760  0.01629 3.437 0.000589

Likelihood ratio test=11.72  on 1 df, p=0.0006199
n= 500, number of events= 381 
< details open=""> < summary>Code
coxph(Surv(time,death) ~ gene_inv_high_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_inv_high_cor, data = df)

                      coef exp(coef) se(coef)      z       p
gene_inv_high_cor -0.12229   0.88489  0.02815 -4.344 1.4e-05

Likelihood ratio test=18.66  on 1 df, p=1.56e-05
n= 500, number of events= 381 
< details open=""> < summary>Code
coxph(Surv(time,death) ~ gene_inv_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_inv_low_cor, data = df)

                     coef exp(coef) se(coef)      z        p
gene_inv_low_cor -0.06808   0.93418  0.01635 -4.165 3.12e-05

Likelihood ratio test=17.45  on 1 df, p=2.946e-05
n= 500, number of events= 381 
< section id="multivariable-coxph-analysis" class="level2" data-number="4">

4 Multivariable CoxPH analysis

After seeing how the result looks like above, next I’d like to try survival modeling using other correlated predictors alone or in combination with each other.

< section id="sec-covar1" class="level3" data-number="4.1">

4.1 gene + gene_high_cor

< details open=""> < summary>Code
# covariable in combo
coxph(Surv(time,death) ~ gene + gene_high_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene + gene_high_cor, data = df)

                 coef exp(coef) se(coef)     z       p
gene          0.20482   1.22731  0.05652 3.624 0.00029
gene_high_cor 0.01150   1.01157  0.05038 0.228 0.81939

Likelihood ratio test=35.9  on 2 df, p=1.604e-08
n= 500, number of events= 381 
< section id="sec-covar2" class="level3" data-number="4.2">

4.2 gene + gene_low_cor

< details open=""> < summary>Code
# covariable in combo
coxph(Surv(time,death) ~ gene + gene_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene + gene_low_cor, data = df)

                coef exp(coef) se(coef)     z        p
gene         0.20134   1.22304  0.04078 4.938 7.91e-07
gene_low_cor 0.01230   1.01237  0.01840 0.668    0.504

Likelihood ratio test=36.29  on 2 df, p=1.317e-08
n= 500, number of events= 381 
< section id="sec-covar3" class="level3" data-number="4.3">

4.3 gene_high_cor + gene_low_cor

< details open=""> < summary>Code
# covariables in combo
coxph(Surv(time,death) ~ gene_high_cor + gene_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_high_cor + gene_low_cor, 
    data = df)

                 coef exp(coef) se(coef)     z        p
gene_high_cor 0.13049   1.13939  0.03431 3.803 0.000143
gene_low_cor  0.03223   1.03276  0.01738 1.855 0.063579

Likelihood ratio test=26.19  on 2 df, p=2.056e-06
n= 500, number of events= 381 
< section id="sec-covar4" class="level3" data-number="4.4">

4.4 gene + gene_inv_high_cor

< details open=""> < summary>Code
# covariable in combo
coxph(Surv(time,death) ~ gene + gene_inv_high_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene + gene_inv_high_cor, 
    data = df)

                     coef exp(coef) se(coef)     z        p
gene              0.27951   1.32248  0.06439 4.341 1.42e-05
gene_inv_high_cor 0.06172   1.06366  0.05146 1.199     0.23

Likelihood ratio test=37.29  on 2 df, p=7.998e-09
n= 500, number of events= 381 
< section id="sec-covar5" class="level3" data-number="4.5">

4.5 gene + gene_inv_low_cor

< details open=""> < summary>Code
# covariable in combo
coxph(Surv(time,death) ~ gene + gene_inv_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene + gene_inv_low_cor, 
    data = df)

                     coef exp(coef) se(coef)      z        p
gene              0.18481   1.20299  0.04044  4.570 4.88e-06
gene_inv_low_cor -0.02845   0.97195  0.01842 -1.544    0.123

Likelihood ratio test=38.23  on 2 df, p=4.983e-09
n= 500, number of events= 381 
< section id="sec-covar6" class="level3" data-number="4.6">

4.6 gene_inv_high_cor + gene_inv_low_cor

< details open=""> < summary>Code
# covariables in combo
coxph(Surv(time,death) ~ gene_inv_high_cor + gene_inv_low_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_inv_high_cor + gene_inv_low_cor, 
    data = df)

                      coef exp(coef) se(coef)      z       p
gene_inv_high_cor -0.09271   0.91146  0.03033 -3.057 0.00224
gene_inv_low_cor  -0.04936   0.95184  0.01745 -2.829 0.00467

Likelihood ratio test=26.68  on 2 df, p=1.606e-06
n= 500, number of events= 381 
< section id="sec-covar7" class="level3" data-number="4.7">

4.7 gene_high_cor + gene_inv_high_cor

< details open=""> < summary>Code
# covariables in combo
coxph(Surv(time,death) ~ gene_high_cor + gene_inv_high_cor, data = df)
Call:
coxph(formula = Surv(time, death) ~ gene_high_cor + gene_inv_high_cor, 
    data = df)

                      coef exp(coef) se(coef)      z      p
gene_high_cor      0.10934   1.11554  0.04239  2.579 0.0099
gene_inv_high_cor -0.05931   0.94241  0.03714 -1.597 0.1103

Likelihood ratio test=25.31  on 2 df, p=3.194e-06
n= 500, number of events= 381 
< section id="sec-covar8" class="level3" data-number="4.8">

4.8 All covariables together

< details open=""> < summary>Code
formula_pred <- formula(
  paste("Surv(time,death) ~",
        paste(
          grep("gene", colnames(df), value  =T),
          collapse = " + ")
        )
  )

# covariables in combo
coxph(formula_pred, data = df)
Call:
coxph(formula = formula_pred, data = df)

                       coef exp(coef)  se(coef)      z      p
gene               0.217201  1.242594  0.085836  2.530 0.0114
gene_high_cor      0.015962  1.016091  0.050472  0.316 0.7518
gene_low_cor       0.009508  1.009553  0.018484  0.514 0.6070
gene_inv_high_cor  0.053296  1.054742  0.052402  1.017 0.3091
gene_inv_low_cor  -0.027322  0.973048  0.018442 -1.481 0.1385

Likelihood ratio test=39.8  on 5 df, p=1.635e-07
n= 500, number of events= 381 
< section id="sec-random-covar" class="level3" data-number="4.9">

4.9 All covariables made random and modelled together

Note

You will see why I’m doing this in the next section.

< details open=""> < summary>Code
# make all covariables random
df[, 4:ncol(df)] <- rnorm(n*(ncol(df)-3), mean = 20, sd = 10)

# covariables in combo
coxph(formula_pred, data = df)
Call:
coxph(formula = formula_pred, data = df)

                        coef  exp(coef)   se(coef)      z        p
gene               2.172e-01  1.243e+00  3.580e-02  6.069 1.29e-09
gene_high_cor     -5.234e-03  9.948e-01  5.211e-03 -1.004    0.315
gene_low_cor      -9.997e-05  9.999e-01  5.430e-03 -0.018    0.985
gene_inv_high_cor -6.686e-03  9.933e-01  4.841e-03 -1.381    0.167
gene_inv_low_cor   3.418e-03  1.003e+00  5.076e-03  0.673    0.501

Likelihood ratio test=39.02  on 5 df, p=2.35e-07
n= 500, number of events= 381 
< section id="take-home-point" class="level2" data-number="5">

5 Take home point

Not unexpectedly, when there are correlated variables in the survival model, the coefficients of individual predictors can change depending on what enters the model. Let’s try to digest this a bit. In our simulated example, variable gene is closely associated with the survival outcome (ie. patients with high gene are dying more ). The other covariables are generated so that they are correlated with the variable gene, but there is no mathematical relationship between these covariables and the survival outcome. In other words, higher levels of gene go in parallel with higher chances for observing death = 1. After capturing this relationship, the addition of random noise to gene to obtain gene_high_cor (and other covariables) does not explain the changes vital status any further. You might have two correlated variables gene and gene_high_cor for example, but only one of them may be associated with the survival outcome whereas the other one is just a derivative of the first one.

This is why you see a “significant” HR value of 1.16609 in the univariable model involving gene_high_cor in Section 3 (keep in mind that this covariable is positively correlated with the variable of interest, gene); but this significance goes away when you model gene_high_cor together with gene in Section 4.1. The same can be said for multivariable analyses involving other covariables.

Finally, when you model all predictors together in Section 4.8, you see that the HR values of all the covariables converge to 1 (ie. no effect on survival) whereas HR value of gene is 1.242594. However, when you look at the p-value, albeit it is still significant (as defined by < 0.05), you’ll see that it is higher now compared to univariable analysis. This may be happening because our covariables could still explain vital status changes at least somewhat. To support this notion, when we made all covariables random and performed survival modeling in Section 4.9, the coefficient and the p-value of gene looked quite similar to the univariable model in Section 3. I hope this post helped you wraped your mind around how CoxPH works when there are correlated variables involved.

Back to top
To leave a comment for the author, please follow the link and comment on their blog: Dr. Atakan Ekiz.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version