Suppressor: The Midas Touch

Wenyao

2 years ago

[This article was first published on Wenyao, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Can adding an additional explanatory variable make previously-insiginificant ones significant?

You are doing regression analysis with lots of variables. You find one of them has a huge p-value. “Drop it,” a voice screams.

Not so fast. In this post, I show that there exists a certain category of explanatory variables (formally known as suppressor), the inclusion of which increases the explanatory power of other exisiting variables, so much so that insignificant ones might end up being significant.

Problem Formulation

Find $x_1$, $x_2$ and $y$ such that $x_1$ is insiginificant according specification:

\[y = \alpha + \beta_1 x_1 + \epsilon\]

but becomes siginificant in the existence of $x_2$:

\[y = \alpha ^ \prime + \beta_1 ^ \prime x_1 + \beta_{ 2 } ^ \prime x_2 + \epsilon ^ \prime\]

By definition, $x_2$ is the suppressor we are looking for.

A Stylized Example

One possible way to find such case would be:

n <- 100
x1 <- rnorm(n, 0, 0.01) # random varaible drawn from normal distribution
x2 <- runif(n, 0, 10) # random variable drawn from uniform distribution
epsilon <- rnorm(n, 0, 0.001)
y <- 3 + 1 * x1 + 1 * x2 + epsilon

Not surprisingly, $x_1$ by itself has negligible explanatory power:

y ~ x1	Estimate	Std. Error	t value	p-value
(Intercept)	8.0420	0.2835	28.368	<2e-16
x1	32.3034	30.2206	1.069	0.288

However, things change dramatically once we bring $x_2$ into the equation:

y ~ x1 + x2	Estimate	Std. Error	t value	p-value
(Intercept)	3.000e+00	2.018e-04	14866.98	<2e-16
x1	9.867e-01	1.052e-02	93.81	<2e-16
x2	1.000e+00	3.497e-05	28599.16	<2e-16

Discussion

The real reason behind the weird example is the fact $x_2$ completely outshines $x_1$ in terms of the ability to explain the variation in $y$. $x_1$’s usefullness is simply overwhelmed by the huge residuals in the absence of $x_2$.

On the contrary, after controlling for $x_2$ (which already does an excellent job in explaining $y$), the small but non-zero remainder of $y$ is almost entirely driven by $x_1$.

Implications

The existence of suprressor calls for caution when we delete explanatory variables soley based on their insiginificance, although it can also be argued that missing such an important regessor in the first place ($x_2$ in this case) is the greater sin.

To leave a comment for the author, please follow the link and comment on their blog: Wenyao.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.