Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Confounding and collinearity
Introduction
In this blog article, I will discuss about the bias introduced in estimation of coefficient of a given explanatory variable due to the presence of confounding factors. After that, I will try to demonstrate about the effect of variable collinearity on estimation of coefficient.
Underlying structure
Let us assume that we know about the underlying relationship between the explanatory variable (E), outcome variable (O), confounder ( C ), variable correlated with E (VE) and variable correlated with O (VO).
Let us assume that the real underlying relation between various variables are as given below:
O = 2.E + 2.C + 2.VO + 0.VE + error E, C and VE are correlated with each other with correlation coef of 0.7 each. All the random variables are normally distributed, with mean 0 and sd 5. error is distributed as standard normal distribution.
The directed acyclic graph depicts the relationship between the variables.
Samples for simulation
library(MASS) library(dplyr) ## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:Hmisc': ## ## src, summarize ## ## The following objects are masked from 'package:lubridate': ## ## intersect, setdiff, union ## ## The following object is masked from 'package:MASS': ## ## select ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union set.seed(100) sam <- mvrnorm(n = 10000, mu = c(0, 0, 0), Sigma = matrix(c(5, 4, 4, 4, 5, 4, 4, 4, 5), 3, 3, byrow = T)) df <- data.frame(E = sam[, 1], C = sam[, 2], VE = sam[, 3]) df <- mutate(df, VO = rnorm(10000, 0, 5), O = 2 * E + 2 * C + 2 * VO + rnorm(10000))
Scenarios
Scenario 1 (Only E as covariate, typical univariate analysis)
mod.E <- lm(O ~ E, data = df) summary(mod.E) ## ## Call: ## lm(formula = O ~ E, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -40.47 -7.05 -0.05 7.01 45.04 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0069 0.1034 0.07 0.95 ## E 3.6420 0.0465 78.40 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 9998 degrees of freedom ## Multiple R-squared: 0.381, Adjusted R-squared: 0.381 ## F-statistic: 6.15e+03 on 1 and 9998 DF, p-value: <2e-16 confint(mod.E) ## 2.5 % 97.5 % ## (Intercept) -0.1959 0.2097 ## E 3.5509 3.7330
The model overestimates coefficient of E, its value is the sum of coef of E and C. This type of bias is known as bias due to confounding and occurs when confounders are not included as co-variates.
Scenario 2 (E and C as covariates)
This model is incorrect as it does not include the other covariate, VO.
mod.EC <- lm(O ~ E + C, data = df) summary(mod.EC) ## ## Call: ## lm(formula = O ~ E + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -35.99 -6.75 -0.11 6.71 43.32 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.0194 0.0998 -0.19 0.85 ## E 2.0293 0.0741 27.40 <2e-16 *** ## C 2.0231 0.0740 27.35 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.98 on 9997 degrees of freedom ## Multiple R-squared: 0.424, Adjusted R-squared: 0.424 ## F-statistic: 3.68e+03 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EC) ## 2.5 % 97.5 % ## (Intercept) -0.215 0.1762 ## E 1.884 2.1745 ## C 1.878 2.1681
Due to addition of C, the estimate of E is unbiased. Exclusion of VO has not influenced the coefficient of E.
Scenario 3 (E and VO as covariates)
E and VO are not correlated. C has been excluded from the model.
mod.EVO <- lm(O ~ E + VO, data = df) summary(mod.EVO) ## ## Call: ## lm(formula = O ~ E + VO, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.638 -1.938 0.005 1.954 12.780 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.03893 0.02884 1.35 0.18 ## E 3.59175 0.01295 277.27 <2e-16 *** ## VO 2.00167 0.00581 344.39 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.88 on 9997 degrees of freedom ## Multiple R-squared: 0.952, Adjusted R-squared: 0.952 ## F-statistic: 9.88e+04 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EVO) ## 2.5 % 97.5 % ## (Intercept) -0.01761 0.09546 ## E 3.56636 3.61714 ## VO 1.99028 2.01307
There is again overestimation of coefficient of E (as in scenario 1), due to exclusion of C. Estimate of VO is unbiased as it has no correlation with either E, VO or VE.
Scenario 4 (E and VE as covariates)
E and VE are correlated, but VE is not associated with O.
mod.EVE <- lm(O ~ E + VE, data = df) summary(mod.EVE) ## ## Call: ## lm(formula = O ~ E + VE, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -41.55 -7.00 -0.02 6.93 43.46 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0106 0.1026 0.1 0.92 ## E 2.8584 0.0761 37.6 <2e-16 *** ## VE 0.9857 0.0761 12.9 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 10.3 on 9997 degrees of freedom ## Multiple R-squared: 0.391, Adjusted R-squared: 0.391 ## F-statistic: 3.21e+03 on 2 and 9997 DF, p-value: <2e-16 confint(mod.EVE) ## 2.5 % 97.5 % ## (Intercept) -0.1905 0.2117 ## E 2.7093 3.0075 ## VE 0.8364 1.1350
Estimates of both E and VE are biased. But coefficient of E is less biased than scenario 1. Reason for biasness of coefficient of VE is unknown to me.
Scenario 5 (E, VE, VO, C as covariates) (Complete model)
mod.comp <- lm(O ~ E + C + VO + VE, data = df) summary(mod.comp) ## ## Call: ## lm(formula = O ~ E + C + VO + VE, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.434 -0.677 -0.014 0.678 3.616 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01297 0.01000 1.30 0.19 ## E 1.98794 0.00826 240.62 <2e-16 *** ## C 1.99993 0.00830 240.81 <2e-16 *** ## VO 2.00033 0.00202 992.51 <2e-16 *** ## VE 0.01214 0.00831 1.46 0.14 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1 on 9995 degrees of freedom ## Multiple R-squared: 0.994, Adjusted R-squared: 0.994 ## F-statistic: 4.29e+05 on 4 and 9995 DF, p-value: <2e-16 confint(mod.comp) ## 2.5 % 97.5 % ## (Intercept) -0.006635 0.03258 ## E 1.971745 2.00413 ## C 1.983651 2.01621 ## VO 1.996381 2.00428 ## VE -0.004159 0.02844
Unbiased coefficients of all the covariates.
Scenario 6 (E, VE, C as covariates)
mod.EVEC <- lm(O ~ E + VE + C, data = df) summary(mod.EVEC) ## ## Call: ## lm(formula = O ~ E + VE + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -36.18 -6.74 -0.08 6.71 43.21 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.0185 0.0998 -0.19 0.85 ## E 1.9892 0.0824 24.13 <2e-16 *** ## VE 0.0920 0.0830 1.11 0.27 ## C 1.9817 0.0829 23.92 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.98 on 9996 degrees of freedom ## Multiple R-squared: 0.424, Adjusted R-squared: 0.424 ## F-statistic: 2.45e+03 on 3 and 9996 DF, p-value: <2e-16 confint(mod.EVEC) ## 2.5 % 97.5 % ## (Intercept) -0.21411 0.1771 ## E 1.82758 2.1507 ## VE -0.07061 0.2546 ## C 1.81930 2.1441
Addition of C has removed the biasedness from the coefficients of E and VE (see Scenario 4).
Scenario 7 (E, VO and C as covariates)
mod.EVOC <- lm(O ~ E + VO + C, data = df) summary(mod.EVOC) ## ## Call: ## lm(formula = O ~ E + VO + C, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.427 -0.676 -0.014 0.680 3.641 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01285 0.01000 1.29 0.2 ## E 1.99323 0.00742 268.50 <2e-16 *** ## VO 2.00036 0.00202 992.52 <2e-16 *** ## C 2.00539 0.00741 270.45 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1 on 9996 degrees of freedom ## Multiple R-squared: 0.994, Adjusted R-squared: 0.994 ## F-statistic: 5.72e+05 on 3 and 9996 DF, p-value: <2e-16 confint(mod.EVOC) ## 2.5 % 97.5 % ## (Intercept) -0.006751 0.03246 ## E 1.978682 2.00779 ## VO 1.996410 2.00431 ## C 1.990857 2.01993
Performance of this model is almost the same as Scenario 5
Performance of different scenarios
Models | Complete (E, C, VO, VE) | E, C, VE | E, C, VO | E, C | E, VE | E, VO | E only |
---|---|---|---|---|---|---|---|
E (2) | 1.9879, 0.0083 | 1.9892, 0.0824 | 1.9932, 0.0074 | 2.0293, 0.0741 | 2.8584, 0.0761 | 3.5918, 0.013 | 3.642, 0.0465 |
C (2) | 1.9999, 0.0083 | 1.9817, 0.0829 | 2.0054, 0.0074 | 2.0231, 0.074 | - | - | - |
VE (0) | 0.0121, 0.0083 | 0.092, 0.083 | - | - | 0.9857, 0.0761 | - | - |
VO (2) | 2.0003, 0.002 | - | 2.0004, 0.002 | - | - | 2.0017, 0.0058 | - |
E (2): means covariate E and its actual value.
–,–: coefficient, standard error.
Conclusions
Models with C, will have unbiased estimate of E. Models with correlated variables (E, C, VE) has larger SE due to near collinearity.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.