Linear Regression and ANOVA shaken and stirred (Part 2)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the first part of this entry I did show some mtcars
examples, where am
can be useful to explain ANOVA as its observations are defined as:
Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involved more variables.
ANOVA with one dummy variable
Consider a model where the outcome is mpg
and the design matrix is \(X = (\vec{1} \: \vec{x}_2)\).
From the last entry let
This will lead to this estimate:
Fitting the model gives:
y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit = lm(y ~ x2) summary(fit) Call: lm(formula = y ~ x2) Residuals: Min 1Q Median 3Q Max -9.3923 -3.0923 -0.2974 3.2439 9.5077 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.392 1.360 17.941 < 2e-16 *** x2 -7.245 1.764 -4.106 0.000285 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.902 on 30 degrees of freedom Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385 F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
So to see the relationship between the estimates and the group means I need additional steps:
x0 = rep(1,length(y)) X = cbind(x0,x2) beta = solve(t(X)%*%X) %*% (t(X)%*%y) beta [,1] x0 24.392308 x2 -7.244939
I did obtain the same estimates with lm
command so now I calculate the group means:
x1 = ifelse(x1 == 0, NA, x1) x2 = ifelse(x2 == 0, NA, x2) m1 = mean(y*x1, na.rm = TRUE) m2 = mean(y*x2, na.rm = TRUE) beta0 = m1 beta2 = m2-m1 beta0;beta2 [1] 24.39231 [1] -7.244939
In this case this means that the slope for the two groups is the same but the intercept is different, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.
Again I'll verify the equivalency between lm
and aov
in this particular case:
y = mtcars$mpg x1 = mtcars$am x2 = ifelse(x1 == 1, 0, 1) fit2 = aov(y ~ x2) fit2$coefficients (Intercept) x2 24.392308 -7.244939
I can calculate the residuals by hand:
mean_mpg = mean(mtcars$mpg) fitted_mpg = fit3$coefficients[1] + fit3$coefficients[2]*mtcars$am observed_mpg = mtcars$mpg TSS = sum((observed_mpg - mean_mpg)^2) ESS = sum((fitted_mpg - mean_mpg)^2) RSS = sum((observed_mpg - fitted_mpg)^2) TSS;ESS;RSS [1] 1126.047 [1] 405.1506 [1] 720.8966
Here its verified that \(TSS = ESS + RSS\) but aside from that I can extract information from aov
:
summary(fit2) Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.2 405.2 16.86 0.000285 *** Residuals 30 720.9 24.0 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And check that, as expected, \(ESS\) is the variance explained by x2
.
I also can run ANOVA over lm
with:
anova(fit) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x2 1 405.15 405.15 16.86 0.000285 *** Residuals 30 720.90 24.03 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The table provides information on the effect of x2
over y
. In this case the null hypothesis is rejected because of the large F-value and the associated p-values.
Considering a 0.05 significance threshold I can say, with 95% of confidence, that the regression slope is statistically different from zero or that there is a difference in group means between automatic and manual transmission.
ANOVA with three dummy variables
Now let's explore something more complex than am
. Reading the documentation I wonder if cyl
has an impact on mpg
so I explore that variable:
str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... unique(mtcars$cyl) [1] 6 4 8
One (wrong) possibility is to write:
y = mtcars$mpg x1 = mtcars$cyl; x1 = ifelse(x1 == 4, 1, 0) x2 = mtcars$cyl; x2 = ifelse(x1 == 6, 1, 0) x3 = mtcars$cyl; x3 = ifelse(x1 == 8, 1, 0) fit = lm(y ~ x1 + x2 + x3) summary(fit) Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -6.2476 -2.2846 -0.4556 2.6774 7.2364 Coefficients: (2 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.6476 0.7987 20.844 < 2e-16 *** x1 10.0160 1.3622 7.353 3.44e-08 *** x2 NA NA NA NA x3 NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.66 on 30 degrees of freedom Multiple R-squared: 0.6431, Adjusted R-squared: 0.6312 F-statistic: 54.06 on 1 and 30 DF, p-value: 3.436e-08
Here the NAs
mean there are variables that are linearly related to the other variables (e.g. the variable pointed with NA
is an average of one or more of the rest of the variables, like \(x_2 = 2x_1 + x_3\) or another linear combination), then there's no unique solution to the regression without dropping variables.
But R has a command called as.factor()
that is useful in these cases and also can save you some lines of code in other cases:
fit2 = lm(mpg ~ as.factor(cyl), data = mtcars) summary(fit2) Call: lm(formula = mpg ~ as.factor(cyl), data = mtcars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** as.factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 *** as.factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
The aov
version of this is:
fit3 = aov(mpg ~ as.factor(cyl), data = mtcars) TukeyHSD(fit3) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = mpg ~ as.factor(cyl), data = mtcars) $`as.factor(cyl)` diff lwr upr p adj 6-4 -6.920779 -10.769350 -3.0722086 0.0003424 8-4 -11.563636 -14.770779 -8.3564942 0.0000000 8-6 -4.642857 -8.327583 -0.9581313 0.0112287
As I said many times in this entry, ANOVA is linear regression. Interpreting the coefficients is up to you.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.