Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Can adding an additional explanatory variable make previously-insiginificant ones significant?
You are doing regression analysis with lots of variables. You find one of them has a huge p-value. “Drop it,” a voice screams.
Not so fast. In this post, I show that there exists a certain category of explanatory variables (formally known as suppressor), the inclusion of which increases the explanatory power of other exisiting variables, so much so that insignificant ones might end up being significant.
Problem Formulation
Find $x_1$, $x_2$ and $y$ such that $x_1$ is insiginificant according specification:
\[y = \alpha + \beta_1 x_1 + \epsilon\]but becomes siginificant in the existence of $x_2$:
\[y = \alpha ^ \prime + \beta_1 ^ \prime x_1 + \beta_{ 2 } ^ \prime x_2 + \epsilon ^ \prime\]By definition, $x_2$ is the suppressor we are looking for.
A Stylized Example
One possible way to find such case would be:
n <- 100 x1 <- rnorm(n, 0, 0.01) # random varaible drawn from normal distribution x2 <- runif(n, 0, 10) # random variable drawn from uniform distribution epsilon <- rnorm(n, 0, 0.001) y <- 3 + 1 * x1 + 1 * x2 + epsilon
Not surprisingly, $x_1$ by itself has negligible explanatory power:
y ~ x1 | Estimate | Std. Error | t value | p-value |
---|---|---|---|---|
(Intercept) | 8.0420 | 0.2835 | 28.368 | <2e-16 |
x1 | 32.3034 | 30.2206 | 1.069 | 0.288 |
However, things change dramatically once we bring $x_2$ into the equation:
y ~ x1 + x2 | Estimate | Std. Error | t value | p-value |
---|---|---|---|---|
(Intercept) | 3.000e+00 | 2.018e-04 | 14866.98 | <2e-16 |
x1 | 9.867e-01 | 1.052e-02 | 93.81 | <2e-16 |
x2 | 1.000e+00 | 3.497e-05 | 28599.16 | <2e-16 |
Discussion
The real reason behind the weird example is the fact $x_2$ completely outshines $x_1$ in terms of the ability to explain the variation in $y$. $x_1$’s usefullness is simply overwhelmed by the huge residuals in the absence of $x_2$.
On the contrary, after controlling for $x_2$ (which already does an excellent job in explaining $y$), the small but non-zero remainder of $y$ is almost entirely driven by $x_1$.
Implications
The existence of suprressor calls for caution when we delete explanatory variables soley based on their insiginificance, although it can also be argued that missing such an important regessor in the first place ($x_2$ in this case) is the greater sin.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.