Linear Regression and ANOVA shaken and stirred

[This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Linear Regression and ANOVA concepts are understood as separate concepts most of the times. The truth is they are extremely related to each other being ANOVA a particular case of Linear Regression.

Even worse, its quite common that students do memorize equations and tests instead of trying to understand Linear Algebra and Statistics concepts that can keep you away from misleading results, but that is material for another entry.

Most textbooks present econometric concepts presentic algebraic steps and do empathise about the relationship between Ordinary Least Squares, Maximum Likelihood and other methods to obtain estimates in Linear Regression.

Here I present a combination of little algebra and R commands to try to clarify some concepts.

Linear Regression

Let \(\renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\R}{\mathbb{R}} \vec{y} \in \R^n\) be the outcome and \(X \in \mathbb{R}^{n\times p}\) be the design matrix in the context of a general model with intercept:

$$\vec{y} = X\vec{\beta} + \vec{e}$$

In R notation, a particular case of this would be the model:

lm(mpg ~ wt + cyl, data = mtcars)

Call:
lm(formula = mpg ~ wt + cyl, data = mtcars)

Coefficients:
(Intercept)           wt          cyl  
     39.686       -3.191       -1.508  

So that, the algebraic expressions

$$ \underset{n\times 1}{\vec{y}} = \begin{pmatrix}y_0 \cr y_1 \cr \vdots \cr y_n\end{pmatrix} \text{ and } \underset{n\times p}{X} = \begin{pmatrix}1 & x_{11} & & x_{1p} \cr 1 & x_{21} & & x_{2p} \cr & \ddots & \cr 1 & x_{n1} & & x_{np}\end{pmatrix} = (\vec{1} \: \vec{x}_1 \: \ldots \: \vec{x}_p) $$

are equivalent to this in R notation:

y = mtcars$mpg
x0 = rep(1, length(y))
x1 = mtcars$wt
x2 = mtcars$cyl
X = cbind(x0,x1,x2)

In linear models the aim is to minimize the error term by chosing \(\hat{\vec{\beta}}\). One possibility is to minimize the squared error by solving this optimization problem:

$$ \begin{equation} \label{min} \displaystyle \min_{\vec{\beta}} S = \|\vec{y} – X\vec{\beta}\|^2 \end{equation} $$

Books such as Baltagi discuss how to solve \(\eqref{min}\) and other equivalent approaches that result in this optimal estimator:

$$ \begin{equation} \label{beta} \hat{\vec{\beta}} = (X^tX)^{-1} X^t\vec{y} \end{equation} $$

In R is the same to use lm or to perform a matrix multiplication because of equation \(\eqref{beta}\):

fit = lm(y ~ x1 + x2)
coefficients(fit)

(Intercept)          x1          x2 
  39.686261   -3.190972   -1.507795 

beta = solve(t(X)%*%X) %*% (t(X)%*%y)
beta

        [,1]
x0 39.686261
x1 -3.190972
x2 -1.507795

With one independent variable and intercept, this is \(y_i = \beta_0 + \beta_1 x_{i1} + e_i\), equation \(\eqref{beta}\) means:

$$ \begin{equation} \label{beta2} \hat{\beta}_1 = cor(\vec{y},\vec{x}) \cdot \frac{sd(\vec{y})}{sd(\vec{x})} \text{ and } \hat{\beta}_0 = \bar{y} – \hat{\beta}_1 \bar{\vec{x}} \end{equation} $$

Equation \(\eqref{beta2}\) can be verified with R commands:

#install.packages("HistData")
require(HistData)
y = Galton$child
x = Galton$parent
beta1 = cor(y, x) *  sd(y) / sd(x)
beta0 = mean(y) - beta1 * mean(x)

c(beta0, beta1)

[1] 23.9415302  0.6462906

#comparing with lm results
lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
    23.9415       0.6463  

Another possibility in linear models is to rewrite the observations in the outcome and the design matrix with respect to the mean of each variable. That will only alter the intercept but not the slope coefficients.

So, for a model like \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + e_i\) I can write the equivalent model:

$$y_i – \bar{y} = \beta_0 + \beta_1 (x_{i1} – \bar{x}_{i1}) + \beta_2 (x_{i2} – \bar{x}_{i2}) + e_i$$

Verifiying in R that the slope coefficients do not change:

new_y = mtcars$mpg - mean(mtcars$mpg)
new_x1 = mtcars$wt - mean(mtcars$wt)
new_x2 = mtcars$cyl - mean(mtcars$cyl)

fit2 = lm(new_y ~ new_x1 + new_x2)
coefficients(fit2)

  (Intercept)        new_x1        new_x2 
 4.710277e-16 -3.190972e+00 -1.507795e+00 

new_X = cbind(x0,new_x1,new_x2)
new_beta = solve(t(new_X)%*%new_X) %*% (t(new_X)%*%new_y)
new_beta

                [,1]
x0      6.879624e-16
new_x1 -3.190972e+00
new_x2 -1.507795e+00

Here the intercept is close to zero, so I can obtain more information to check the significance:

summary(fit2)

Call:
lm(formula = new_y ~ new_x1 + new_x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.065e-16  4.539e-01   0.000 1.000000    
new_x1      -3.191e+00  7.569e-01  -4.216 0.000222 ***
new_x2      -1.508e+00  4.147e-01  -3.636 0.001064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

In this particular case I should drop the intercept because its not significant so I write:

fit3 = lm(new_y ~ new_x1 + new_x2 - 1)
coefficients(fit3)

   new_x1    new_x2 
-3.190972 -1.507795 

new_X = cbind(new_x1,new_x2)
new_beta = solve(t(new_X)%*%new_X) %*% (t(new_X)%*%new_y)
new_beta

            [,1]
new_x1 -3.190972
new_x2 -1.507795

ANOVA

The total sum of squares is defined as the sum of explained and residual (or unexplained) sum of squares

$$TSS = ESS + RSS = \sum_i (\hat{y}_i – \bar{y})^2 + \sum_i (y_i – \hat{y}_i)^2 = \sum_i (y_i – \bar{y})^2 $$

To perform an analysis of variance consider that \(TSS\) follows an \(F(p,n-1)\) distribution with \(n-1\) degrees of freedom. This is, \(ESS\) has \(p\) degrees of freedom and \(RSS\) has \(n-p-1\) degrees of freedom and the F-statistic is

$$ F = \frac{ESS/p}{RSS/(n-p-1)} $$

This statistics tests the null hypothesis \(\vec{\beta} = \vec{0}\).

The term analysis of variance refers to categorical predictors so ANOVA is a particular case of the linear model that works around the statistical test just described and the difference in group means.

In the mtcars dataset, am can be useful to explain ANOVA as its observations are defined as:

$$ x_i = \begin{cases}1 &\text{ if car } i \text{ is manual} \cr 0 &\text{ if car } i \text{ is automatic}\end{cases} $$

Now consider a model where the outcome is mpg and design matrix is \(X = (\vec{x}_1 \: \vec{x}_2)\) so that the terms are defined in this way:

y = mtcars$mpg
x1 = mtcars$am 
x2 = ifelse(x1 == 1, 0, 1)

The estimates without intercept would be:

fit = lm(y ~ x1 + x2 - 1)
summary(fit)

Call:
lm(formula = y ~ x1 + x2 - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
x1   24.392      1.360   17.94  < 2e-16 ***
x2   17.147      1.125   15.25 1.13e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.9487,    Adjusted R-squared:  0.9452 
F-statistic: 277.2 on 2 and 30 DF,  p-value: < 2.2e-16

Taking \(\eqref{beta}\) and replacing in this particular case would result in this estimate:

$$ \hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 \end{bmatrix} $$

being \(\bar{y}_1\) and \(\bar{y}_2\) the group means.

This can be verified with R commands:

y1 = y*x1; y1 = ifelse(y1 == 0, NA, y1)
y2 = y*x2; y2 = ifelse(y2 == 0, NA, y2)

mean(y1, na.rm = TRUE)

[1] 24.39231

mean(y2, na.rm = TRUE)

[1] 17.14737

If you are not convinced of this result you can write down the algebra or use R commands. I'll do the last with the notation \(U = (X^tX)^{-1}\) and \(V = X^t\vec{y}\):

X = cbind(x1,x2)
U = solve(t(X)%*%X)
V = t(X)%*%y

U;V;U%*%V

           x1         x2
x1 0.07692308 0.00000000
x2 0.00000000 0.05263158

    [,1]
x1 317.1
x2 325.8

       [,1]
x1 24.39231
x2 17.14737

\(U\) entries are just one over the number of observations of each group and V entries are the sum of mpg observations of each group so that the entries of \(UV\) are the means of each group:

u11 = 1/sum(x1)
u22 = 1/sum(x2)

v11 = sum(y1, na.rm = TRUE)
v21 = sum(y2, na.rm = TRUE)

u11;u22

[1] 0.07692308

[1] 0.05263158

v11;v21

[1] 317.1

[1] 325.8

u11*v11;u22*v21

[1] 24.39231

[1] 17.14737

Aside from algebra, now I'll show the equivalency between lm and aov that is the command used to perform and analysis of variance:

y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)

fit2 = aov(y ~ x1 + x2 - 1)

fit2$coefficients

      x1       x2 
24.39231 17.14737 

summary(fit2)

          Df Sum Sq Mean Sq F value   Pr(>F)    
x1         1   7735    7735   321.9  < 2e-16 ***
x2         1   5587    5587   232.5 1.13e-15 ***
Residuals 30    721      24                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows there is no evidence to hold the null hyphotesis, so the difference in group means is statistically significant.

Changing the design matrix to \(X = (\vec{1} \: \vec{x}_1)\) will lead to the estimate:

$$ \hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_2 \cr \bar{y}_1 - \bar{y}_2 \end{bmatrix} $$

Estimating the model results in:

y = mtcars$mpg
x1 = mtcars$am

fit = lm(y ~ x1)
summary(fit)

Call:
lm(formula = y ~ x1)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   17.147      1.125  15.247 1.13e-15 ***
x1             7.245      1.764   4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

So to see the relationship between the estimates and the group means I need additional steps:

x0 = rep(1,length(y))
X = cbind(x0,x1)
beta = solve(t(X)%*%X) %*% (t(X)%*%y)
beta

        [,1]
x0 17.147368
x1  7.244939

I did obtain the same estimates with lm command so now I calculate the group means:

x2 = ifelse(x1 == 1, 0, 1)

x1 = ifelse(x1 == 0, NA, x1)
x2 = ifelse(x2 == 0, NA, x2)

m1 = mean(y*x1, na.rm = TRUE)
m2 = mean(y*x2, na.rm = TRUE)

beta0 = m2
beta1 = m1-m2

beta0;beta1

[1] 17.14737

[1] 7.244939

In this case this means that the slope for the two groups is the same but the intercept differences, and therefore exists a positive effect of manual transmission on miles per gallon in average terms.

Again I'll verify the equivalency between lm and aov in this particular case:

y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)

fit2 = aov(y ~ x1)
fit2$coefficients

(Intercept)          x1 
  17.147368    7.244939 

summary(fit2)

            Df Sum Sq Mean Sq F value   Pr(>F)    
x1           1  405.2   405.2   16.86 0.000285 ***
Residuals   30  720.9    24.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows there is no evidence to hold the null hyphotesis, so the difference in group means is statistically significant.

If now I change the design matrix to \(X = (\vec{1} \: \vec{x}_2)\) will lead to this estimate:

$$ \hat{\vec{\beta}} = \begin{bmatrix}\bar{y}_1 \cr \bar{y}_2 - \bar{y}_1\end{bmatrix} $$

Estimating the model gives:

y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)

fit = lm(y ~ x2)
summary(fit)

Call:
lm(formula = y ~ x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3923 -3.0923 -0.2974  3.2439  9.5077 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.392      1.360  17.941  < 2e-16 ***
x2            -7.245      1.764  -4.106 0.000285 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.3598,    Adjusted R-squared:  0.3385 
F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

So to see the relationship between the estimates and the group means I need additional steps:

x0 = rep(1,length(y))
X = cbind(x0,x2)
beta = solve(t(X)%*%X) %*% (t(X)%*%y)
beta

        [,1]
x0 24.392308
x2 -7.244939

I did obtain the same estimates with lm command so now I calculate the group means:

x1 = ifelse(x1 == 0, NA, x1)
x2 = ifelse(x2 == 0, NA, x2)

m1 = mean(y*x1, na.rm = TRUE)
m2 = mean(y*x2, na.rm = TRUE)

beta0 = m1
beta2 = m2-m1

beta0;beta2

[1] 24.39231

[1] -7.244939

In this case this means that the slope for the two groups is the same but the intercept differences, and therefore exists a negative effect of automatic transmission on miles per gallon in average terms.

Again I'll verify the equivalency between lm and aov in this particular case:

y = mtcars$mpg
x1 = mtcars$am
x2 = ifelse(x1 == 1, 0, 1)

fit2 = aov(y ~ x2)
fit2$coefficients

(Intercept)          x2 
  24.392308   -7.244939 

summary(fit2)

            Df Sum Sq Mean Sq F value   Pr(>F)    
x2           1  405.2   405.2   16.86 0.000285 ***
Residuals   30  720.9    24.0                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows there is no evidence to hold the null hyphotesis, so the difference in group means is statistically significant.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)