What Happens If Our Model Adjustment Includes A Collider?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Beware of what we adjust. As we have demonstrated, adjusting for a collider variable can lead to a false estimate in your analysis. If a collider is included in your model, relying solely on AIC/BIC for model selection may provide misleading results and give you a false sense of achievement.
Let’s DAG out The True Causal Model
# Loading libraries library(ipw) library(tidyverse) library(dagitty) library(DT) # Creating the actual/known DAG dag2 <- dagitty('dag { bb="0,0,1,1" W [adjusted,pos="0.422,0.723"] X [exposure,pos="0.234,0.502"] Y [outcome,pos="0.486,0.498"] Z [pos="0.232,0.264"] collider [pos="0.181,0.767"] W -> X W -> Y X -> Y X -> collider Y -> collider Z -> W Z -> X } ') plot(dag2)
X: Treatment/Exposure variable/node.
Y: Outcome variable/node.
Z: Something… not sure what this is called 🤣.
W: Confounder.
collider: Collider, V-structure.
Let’s find out what is our minimal adjustment for total effect
adjust <- adjustmentSets(dag2) adjust ## { W }
Perfect! We only need to adjust W. Now we know which node is correct to adjust, let’s simulate data!
set.seed(1) # set number of observations n <- 100 z <- rnorm(n) w <- 0.6*z + rnorm(n) x <- 0.5*z + 0.2*w + rnorm(n) y <- 0.5*x + 0.4*w collider <- -0.4*x + -0.4*y # turning y and x into binary/categorical x1=ifelse(x>mean(x),1,0) y1=ifelse(y>mean(y),1,0) # create a dataframe for the simulated data df <- tibble(z=z,w=w,y=y1,x=x1, collider=collider) datatable(df)
The reason we converted x
and y
into binary variables was to replicate a question that is often encountered in our research. In many cases, both the treatment and outcome are binary. Hence, the transformation. However, it is possible to keep the variables as they are and use linear regression instead of logistic regression, which I’ll be using shortly. Additionally, I used the mean
to binarize both variables into 1
or 0
because they were approximately normally distributed, and I wanted to balance them. You can experiment with different threshold values if needed.
Simple Model (y ~ x)
model_c <- glm(y~x, data=df) summary(model_c) ## ## Call: ## glm(formula = y ~ x, data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.8148 -0.1739 0.1852 0.1852 0.8261 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.17391 0.05721 3.040 0.00304 ** ## x 0.64090 0.07786 8.232 8.11e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.15058) ## ## Null deviance: 24.960 on 99 degrees of freedom ## Residual deviance: 14.757 on 98 degrees of freedom ## AIC: 98.441 ## ## Number of Fisher Scoring iterations: 2
Note that our true x
coefficient
is 0.5
. Our current naive model shows 0.6409018
Correct Model (y ~ x + w) ✅
model_cz <- glm(y~x + w, data=df) summary(model_cz) ## ## Call: ## glm(formula = y ~ x + w, data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.6073 -0.2124 -0.0082 0.2211 0.5647 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.25054 0.04420 5.668 1.48e-07 *** ## x 0.48667 0.06158 7.903 4.32e-12 *** ## w 0.24174 0.02808 8.609 1.34e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.08623743) ## ## Null deviance: 24.960 on 99 degrees of freedom ## Residual deviance: 8.365 on 97 degrees of freedom ## AIC: 43.677 ## ## Number of Fisher Scoring iterations: 2
Wow, even the correct model models x
coefficient of 0.4866734. Not bad with an n
of 100
Alright, What About Everything Including Collider (y ~ x + z + w + collider) ❌
model_czwall <- glm(y~x + z + w + collider, data=df) summary(model_czwall) ## ## Call: ## glm(formula = y ~ x + z + w + collider, data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.53681 -0.23608 -0.00639 0.20049 0.50698 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.354932 0.054370 6.528 3.25e-09 *** ## x 0.273625 0.090835 3.012 0.00332 ** ## z 0.004099 0.039375 0.104 0.91731 ## w 0.192690 0.032518 5.926 4.97e-08 *** ## collider -0.198703 0.066765 -2.976 0.00370 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.07997983) ## ## Null deviance: 24.9600 on 99 degrees of freedom ## Residual deviance: 7.5981 on 95 degrees of freedom ## AIC: 38.06 ## ## Number of Fisher Scoring iterations: 2
😱 x
is now 0.2736252. Not good!
Let’s Visualize All Models
load("df_model.rda") df_model |> mutate(color = case_when( str_detect(formula, "collider") ~ 1, TRUE ~ 0 )) |> ggplot(aes(x=formula,y=estimate,ymin=lower,ymax=upper)) + geom_point(aes(color=as.factor(color))) + geom_linerange(aes(color=as.factor(color))) + scale_color_manual(values = c("grey","red")) + coord_flip() + geom_hline(yintercept = 0.5) + theme_minimal() + theme(legend.position = "none")
The red lines represent models with colliders adjusted. It’s important to observe that none of these models contain the true value within their 95% confidence intervals. Adjusting for colliders can lead to biased estimates, particularly when the colliders directly affect both the treatment and outcome variables. Careful consideration should be given to the inclusion of colliders in the analysis to avoid potential distortions in the results.”
Let’s check IPW
ipw <- ipwpoint( exposure = x, family = "binomial", link = "logit", denominator = ~ w, data = as.data.frame(df)) model_cipw <- glm(y ~ x, data = df |> mutate(ipw=ipw$ipw.weights), weights = ipw) summary(model_cipw) ## ## Call: ## glm(formula = y ~ x, data = mutate(df, ipw = ipw$ipw.weights), ## weights = ipw) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.3954 -0.3649 0.3050 0.3581 1.5342 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.25974 0.06371 4.077 9.29e-05 *** ## x 0.46462 0.08946 5.194 1.12e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.3980486) ## ## Null deviance: 49.746 on 99 degrees of freedom ## Residual deviance: 39.009 on 98 degrees of freedom ## AIC: 131.05 ## ## Number of Fisher Scoring iterations: 2
x
is now 0.4646218. Quite similar to before. What if we just try a collider?
ipw <- ipwpoint( exposure = x, family = "binomial", link = "logit", denominator = ~ collider, data = as.data.frame(df)) model_cipw2 <- glm(y ~ x, data = df |> mutate(ipw=ipw$ipw.weights), weights = ipw) summary(model_cipw2) ## ## Call: ## glm(formula = y ~ x, data = mutate(df, ipw = ipw$ipw.weights), ## weights = ipw) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9717 -0.3421 0.3829 0.3835 1.9563 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.34073 0.07161 4.758 6.73e-06 *** ## x 0.27633 0.09741 2.837 0.00554 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.315229) ## ## Null deviance: 33.429 on 99 degrees of freedom ## Residual deviance: 30.892 on 98 degrees of freedom ## AIC: 156.91 ## ## Number of Fisher Scoring iterations: 2
Youza! x
is now 0.2763335 with collider
. NOT GOOD !!!
Some clinical examples of adjusting for colliders that lead to d-connection include situations where the treatment and outcome have a common cause that is also adjusted for, such as when the outcome is mortality and the collider is the recurrence of a medical condition. In this scenario, adjusting for the common cause (recurrence of the condition) could lead to d-connection because both the treatment and mortality can directly affect the recurrence of the medical condition (where recurrence would likely decrease when mortality occurs).”
What If We Just Use Stepwise Regression?
datatable(df_model |> select(formula,aic,bic), options = list(order = list(list(2,"asc"))))
Wow, when sorted in ascending order, the first 2 lowest AIC values are associated with models that include colliders! This is surprising! It highlights the importance of caution when using stepwise regression or automated model selection techniques, especially if you are uncertain about the underlying causal model. Blindly relying on these methods without understanding the causal relationships can lead to misleading results.
Lessons learnt
- Having a well-defined Causal Estimand is crucial! It requires careful consideration of the clinical context and the specific question you want to address.
- Blindly adjusting for all available variables can be dangerous, as it may lead to spurious correlations. Selecting variables to adjust for should be guided by domain knowledge and causal reasoning.
- If you’re interested in accessing the code for all of the models click here
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.