An Interesting Aspect of the Omitted Variable Bias
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Econometrics does not cease to surprise me. I just now realized an interesting feature of the omitted variable bias. Consider the following model:
Assume we want to estimate the causal effect beta
of x
on y
. However, we have an unobserved confounder z
that affects both x
and y
. If we don’t add the confounder z
as control variable in the regression of y
on x
, the OLS estimator of beta
will be biased. That is the so called omitted variable bias.
Let’s simulate a data set and illustrate the omitted variable bias:
n = 10000 alpha = beta = gamma = 1 z = rnorm(n,0,1) eps.x = rnorm(n,0,1) eps.y = rnorm(n,0,1) x = alpha*z + eps.x y = beta*x + gamma*z + eps.y # Estimate short regression with z omitted coef(lm(y~x))[2] ## x ## 1.486573
While the true causal effect beta
is equal to 1, our OLS estimator where we omit z
is around 1.5
. This means it has a positive bias of roughly 0.5
.
Before we continue, let’s have a quiz (click here if the Google form quiz is not correctly shown.):
Let’s see what happens if we increase the impact of the confounder z
on x
, say to alpha=1000
.
alpha = 1000 x = alpha*z + eps.x y = beta*x + gamma*z + eps.y coef(lm(y~x))[2] ## x ## 1.000983
The bias is almost gone!
This result surprised me at first. I previously had the following intuition: An omitted variable is only a problem if it affects both y
and x
. Thus the omitted variable bias probably becomes worse if the confounder z
affects y
or x
more strongly. While this intuition is correct for small alpha
, it is wrong once alpha
is sufficiently large.
For our simulation, we can derive the following analytic formula for the (asymptotic) bias of the OLS estimator $\hat \beta$ in the short regression:
asy.Bias(ˆβ)=γαVar(z)α2Var(z)+Var(εx)(From now on, I use Mathjax. If you read on a blog aggregator where Mathjax is not well rendered click here.)
Let’s plot the bias for different values of $\alpha$:
Var.z = Var.eps.x = 1 alpha = seq(0,10,by=0.1) asym.bias = gamma*alpha * Var.z / (alpha^2*Var.z+Var.eps.x) plot(alpha,asym.bias)
For small $\alpha$ the bias of $\hat \beta$ first quickly increases in $\alpha$. But it decreases in $\alpha$ once $\alpha$ is larger than 1. Indeed the bias then slowly converges back to 0.
Intuitively, if $\alpha$ is large, the explanatory variable $x$ has a lot of variation and the confounder mainly affects $y$ through $x$. The larger is $\alpha$, the relatively less important is therefore the direct effect of $z$ on $y$. The direct effect from $z$ on $y$ will thus bias the OLS estimator $\hat \beta$ of the short regression less and less.
Typical presentation of the omitted variable bias formula
Note that the omitted variable bias formula is usually presented as follows:
Bias(ˆβ)=γˆδwhere $\hat \delta$ is the OLS estimate of the linear regression
z=const+δx+u(This bias formula is derived under the assumption that $x$ and $z$ are fixed. This allows to compute the bias, not only the asymptotic bias.) If we solve the equation above for $x$, we can write it as
x=~const+1δz+˜usuggesting $\alpha \approx \frac 1 \delta$ and thus an approximate bias of $\frac \gamma \alpha$. (This argumentation is just suggestive but not fully correct. The effects of swapping the y
and x
in a simple linear regression can be a bit surprising, see my previous post.)
If we look at our previous formula for the asymptotic bias and consider in the limit of no exogenous variation of $x$, i.e. $Var(\varepsilon_x) = 0$, we indeed get
limVar(εx)→0asy.Bias(ˆβ)=γαHowever, the presence of exogenous variation in $x$ makes the bias formula more complicated. In particular, it has the effect that as long as $\alpha$ is still small, the bias increases in $\alpha$.
Appendix: Derivation of the asymptotic bias formula
Here is just a short derivation of the first asymptotic bias formula. We estimate a simple regression (just one explanatory variable):
y=const+βx+ηFor example, the introductionary textbook by Wooldridge shows in the chapter on the OLS asymptotics that under relatively weak assumptions the asymptotic bias of the OLS estimator $\hat{\beta}$ in such a simple regression is given by
asy.Bias(ˆβ)=Cov(x,η)Var(x)In our simulation, the error term of the short regression is given by
η=γz+εyand $x$ is given by
x=αz+εxwhere and $\varepsilon_{y}$ and $\varepsilon_{x}$ are iid errors. We thus have
Cov(x,η)=αγVar(z)and
Var(x)=α2Var(z)+Var(εx)Hence we get the asymptotic bias formula
asy.Bias(ˆβ)=αγVar(z)α2Var(z)+Var(εx)R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.