Two ways that correlation and stepwise regression can give different results
[This article was first published on Minding the Brain, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
[Expanding on my recent answer on Cross Validated, aka stats.stackexchange.com]
In general, a correlation test is used to test the association between two variables (y and z). However, if there is a third variable (x) that might be related to z or y, it makes sense to use stepwise regression (or partial correlation). There are two quite different situations where the correlation and stepwise regression will produce different results. Here are some examples using made up data.
set.seed(1) x <- rnorm(100) z <- x + rnorm(100) y <- x + rnorm(100, sd = 0.1) dat.1 <- data.frame(x = x, y = y, z = z) cor.test(~y + z, data = dat.1) ## ## Pearson's product-moment correlation ## ## data: y and z ## t = 9.058, df = 98, p-value = 1.332e-14 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.5518 0.7694 ## sample estimates: ## cor ## 0.675 ## anova(lm(y ~ x, data = dat.1), lm(y ~ x + z, data = dat.1)) ## Analysis of Variance Table ## ## Model 1: y ~ x ## Model 2: y ~ x + z ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 98 1.06 ## 2 97 1.06 1 0.0026 0.24 0.63In this case, the correlation test showed an association between z and y, but that association was really just a by-product of by both variables' association with x. The stepwise regression revealed that z made no independent contribution to y after x was already included in the model.
The second, perhaps less obvious, case is when the relationship between z and y is masked by variance in y due to x. In other words, x and z are completely unrelated and both are related to y, but the variance in y due to x is very large. Here is an example of this situation:
Case 2. z and x are completely unrelated, both x and z cotribute to y, but the variance in x is much larger (10x) than the variance in z.
set.seed(1) x <- rnorm(100, sd = 10) z <- rnorm(100) y <- x + z + rnorm(100, sd = 0.1) dat.2 <- data.frame(x = x, z = z, y = y) cor.test(~y + z, data = dat.2) ## ## Pearson's product-moment correlation ## ## data: y and z ## t = 1.04, df = 98, p-value = 0.3009 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.09387 0.29484 ## sample estimates: ## cor ## 0.1045 ## anova(lm(y ~ x, data = dat.2), lm(y ~ x + z, data = dat.2)) ## Analysis of Variance Table ## ## Model 1: y ~ x ## Model 2: y ~ x + z ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 98 90.9 ## 2 97 1.1 1 89.9 8254 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this situation, the simple (bivariate) correlation between y and z did not reach significance, but their relationship emerged in a stepwise regression once the variance due to x was accounted for (for this simple case, partial correlation would work also). The bottom line is that when you're considering the relationship between an outcome and a predictor, if you know that your outcome variable has an important relationship with some third variable, stepwise regression (or partial correlation) can (1) make sure an observed associaiton is not due to the third variable and (2) reveal an association that could be masked by the third variable.
To leave a comment for the author, please follow the link and comment on their blog: Minding the Brain.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.