Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Front-door adjustment: a superhero method for handling unobserved confounding by using mediators (if present) to estimate causal effects accurately.
Let’s DAG an estimand! < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
library(dagitty) dag <- dagitty('dag { bb="0,0,1,1" U [pos="0.402,0.320"] X [exposure,pos="0.176,0.539"] Y [outcome,pos="0.614,0.539"] Z [pos="0.409,0.539"] U -> X U -> Y X -> Z Z -> Y } ') plot(dag)
You can easily draw a DAG on
dagitty and then copy and paste the code and slot it in the dag { code }
and voila!
X
is our treatment variable.
Y
is our outcome variable.
U
is confounder that is not measured.
Z
is our mediator.
Assume We Know The Truth < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
library(tidyverse) library(DT) # pseudorandomization set.seed(1) # Set sample n <- 1000 # Set noise of each nodes ux <- rnorm(n) uy <- rnorm(n) uz <- rnorm(n) uu <- rnorm(n) # Set each nodes' equation u <- uu x <- 0.4*u + ux z <- 0.6*x + uz y <- -0.5*z + uy + 0.6*u # Create a table df <- tibble(x=x,y=y,z=z,u=u) # For easier viewing on blog datatable(df)
The reason to assign uu, uy, uz, uu
is basically to introduce some randomness to the equation. Then, the magic is when we write out the equations for u, x, z, y
. This is what we meant by “knowing the truth”. We know exactly what coefficients for equations. For example, How much z
would increase if there is an increase of 1 unit of x
? The answer would be 0.6
, as that is the coefficient we had set. We also know the exact u
variable values are as well, but we’ll assume that we don’t and we’ll check if front-door adjustment
can help us.
The Right Model ✅ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
In order for us to know what a wrong model is, we have to know what the right model is first. Since we know u
, let’s model it and find the correct coefficient of x
, holding u
constant.
As you can see the screenshot above, if we have the luxury to adjust for u
, then we have essentially d-seperated x
and y
and we can estimate their coefficient. Let’s put in into linear regression model.
correct_model <- lm(y~x+u,df) summary(correct_model) ## ## Call: ## lm(formula = y ~ x + u, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.4698 -0.7571 -0.0529 0.7869 3.8100 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.02431 0.03641 -0.668 0.505 ## x -0.31841 0.03520 -9.045 <2e-16 *** ## u 0.61805 0.03810 16.220 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.151 on 997 degrees of freedom ## Multiple R-squared: 0.2142, Adjusted R-squared: 0.2126 ## F-statistic: 135.8 on 2 and 997 DF, p-value: < 2.2e-16
Looking at the coefficient for x
, the correct estimate is -0.3184144
The Wrong Model ❌ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Now, let’s look at the wrong model. Let’s say naively, we want to fit just y ~ x
. What would happen?
model_no_front <- lm(y~x,df) summary(model_no_front) ## ## Call: ## lm(formula = y ~ x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.1379 -0.8724 -0.0593 0.9068 4.0849 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.01287 0.04091 -0.315 0.75319 ## x -0.09505 0.03641 -2.611 0.00917 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.294 on 998 degrees of freedom ## Multiple R-squared: 0.006784, Adjusted R-squared: 0.005789 ## F-statistic: 6.817 on 1 and 998 DF, p-value: 0.009167
Did you notice that the x
coefficient is -0.0950511 ?
Let’s look at another wrong model, let’s control for Z and see what that shows?
model <- lm(y~z+x,df) summary(model) ## ## Call: ## lm(formula = y ~ z + x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.0450 -0.8006 0.0104 0.8316 3.9043 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.005363 0.037782 -0.142 0.887 ## z -0.483048 0.036701 -13.162 < 2e-16 *** ## x 0.216663 0.041125 5.268 1.69e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.195 on 997 degrees of freedom ## Multiple R-squared: 0.1538, Adjusted R-squared: 0.1521 ## F-statistic: 90.61 on 2 and 997 DF, p-value: < 2.2e-16
Wait, what? Now x
is 0.2166626 ??? The correct coefficient is -0.3184144 as we previously had calculated. What is going on here? 🤪
Here is a screenshot of wrongly adjusting for z
What do we do? 🤷♂️ < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
In Comes The Front-door Adjustment! If we’re lucky… 🍀 < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
Let’s look at the equation to calculate for front-door adjustment. It is divided into 2 steps
1. Estimate Z and do(X)
The game of doCalculus is to how to get rid of the
do
\(p(Z|do(X)) = \sum p(Z|X)\)
.
Why is the above estimate true? Let’s go back to the DAG and change outcome to z
y
? Can’t see it? Here you go: z -> y <- u -> x
. Sometimes writing the equation out really helps.
# 1. Estimate Z and do(X) model2 <- lm(z~x,df) summary(model2) ## ## Call: ## lm(formula = z ~ x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5756 -0.6448 -0.0196 0.7013 2.7987 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01553 0.03258 0.477 0.634 ## x 0.64531 0.02900 22.254 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.03 on 998 degrees of freedom ## Multiple R-squared: 0.3317, Adjusted R-squared: 0.331 ## F-statistic: 495.2 on 1 and 998 DF, p-value: < 2.2e-16
x
coefficient is 0.6453056.
2. Estimate Y and do(Z)
\(p(Y|do(Z)) = \sum p(Y|Z,X).p(X)\)
.
Why is the above estimate true? Let’s go back to the DAG and change outcome to y
and exposure to z
y <- u -> x -> z
. What is it? x
!!! The treatment itself, how uncanny !!!
Let’s check out this model
model <- lm(y~z+x,df) summary(model) ## ## Call: ## lm(formula = y ~ z + x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.0450 -0.8006 0.0104 0.8316 3.9043 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.005363 0.037782 -0.142 0.887 ## z -0.483048 0.036701 -13.162 < 2e-16 *** ## x 0.216663 0.041125 5.268 1.69e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.195 on 997 degrees of freedom ## Multiple R-squared: 0.1538, Adjusted R-squared: 0.1521 ## F-statistic: 90.61 on 2 and 997 DF, p-value: < 2.2e-16
z
coefficient is -0.4830482.
3. Put it together – x coefficient * z coefficient
The true x
coefficient in relation to y
is essentially the multiplication of the above estimates!
\(p(Y|do(X)) = p(Y|do(Z)).p(Z|do(X)) = \sum p(Y|Z,X').p(X').\sum p(Z|X)\)
Let’s see if this is true with our codes and data
#correct estimated coefficient model$coefficients[["z"]]*model2$coefficients[["x"]] ## [1] -0.3117137
Our estimate x
coefficient is -0.3117137. Whereas our true x
coefficient is -0.3184144. About 0.0067008 difference. Not too shabby at all !!! Notice that we did not need u
data at all to estimate this. Awesomeness!!!
Yes, it’s not going to be exactly what the true estimate is but it’s close enough! You can mess around the n
by increasing it and you will then get closer to the true estimate. Also depends on your set.seed
, the true coefficient will be different too, give seed 123
a try.
Lessons learnt < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">
- Front-door analysis is another great tool when unobserved confounder is a problem
- Learnt how to estimate front-door formula from scratch is very helpful in understanding how the whole thing works
- To estimate whether a method is beneficial we can simulate equations and data and test the theory
- there are other packages such as doWhy which will calculate all these for you without a hassle in deriving your own formula
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.