Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:
Assume we want to estimate the causal effect beta
of x
on y
. However, we have an unobserved confounder z
that affects both x
and y
. If we don’t add the confounder z
as control variable in the regression of y
on x
, the OLS estimator of beta
will be biased. That is the so called omitted variable bias.
Let’s simulate a data set and illustrate the omitted variable bias:
n = 10000 alpha = beta = gamma = 1 z = rnorm(n,0,1) eps.x = rnorm(n,0,1) eps.y = rnorm(n,0,1) x = alpha*z + eps.x y = beta*x + gamma*z + eps.y # Estimate short regression with z omitted coef(lm(y~x))[2] ## x ## 1.486573
While the true causal effect beta
is equal to 1, our OLS estimator where we omit z
is around 1.5
. This means it has a positive bias of roughly 0.5
.
Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):
Let’s see what happens if we increase the impact of the confounder z
on x
, say to alpha=1000
.
alpha = 1000 x = alpha*z + eps.x y = beta*x + gamma*z + eps.y coef(lm(y~x))[2] ## x ## 1.000983
The bias is almost gone!
This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y
and x
. Thus the omitted variable bias probably becomes worse if the confounder z
affects y
or x
more strongly. While this intuition is correct for small alpha
, it is wrong once alpha
is sufficiently large.
For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:
\[asy. \; Bias(\hat \beta) = \gamma\alpha\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_x)}\](From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)
Let’s plot the bias for different values of $\alpha$:
Var.z = Var.eps.x = 1 alpha = seq(0,10,by=0.1) asym.bias = gamma*alpha * Var.z / (alpha^2*Var.z+Var.eps.x) plot(alpha,asym.bias)
For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.
Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.
Typical presentation of the omitted variable bias formula
Note that the omitted variable bias formula is usually presented as follows:
\[Bias(\hat \beta) = \gamma \hat \delta\]where $\hat \delta$ is the OLS estimate of the linear regression
\[z = const + \delta x + u\](This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as
\[x=\tilde{const} + \frac 1 \delta z + \tilde u\]suggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y
and x
in a simple linear regression can be a bit surprising, see my previous post.)
If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get
\[\lim_{Var(\varepsilon_x)\rightarrow 0 } asy. \; Bias(\hat \beta) = \frac \gamma\alpha\]However, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.
Appendix: Derivation of the asymptotic bias formula
Here is just a short derivation of the first asymptotic bias formula. We estimate a simple regression (just one explanatory variable):
\[y=const+\beta x+\eta\]For example, the introductionary textbook by Wooldridge shows in the chapter on the OLS asymptotics that under relatively weak assumptions the asymptotic bias of the OLS estimator $\hat{\beta}$ in such a simple regression is given by
\[asy.\; Bias(\hat{\beta})=\frac{Cov(x,\eta)}{Var(x)}\]In our simulation, the error term of the short regression is given by
\[\eta=\gamma z+\varepsilon_{y}\]and $x$ is given by
\[x=\alpha z+\varepsilon_{x}\]where and $\varepsilon_{y}$ and $\varepsilon_{x}$ are iid errors. We thus have
\[Cov(x,\eta)=\alpha\gamma Var(z)\]and
\[Var(x)=\alpha^{2}Var(z)+Var(\varepsilon_{x})\]Hence we get the asymptotic bias formula
\[asy.\; Bias(\hat{\beta})=\alpha\gamma\frac{Var(z)}{\alpha^{2}Var(z)+Var(\varepsilon_{x})}\]R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.