Site icon R-bloggers

How to: one-way ANOVA by hand

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Introduction

    An ANOVA is a statistical test used to compare a quantitative variable between groups, to determine if there is a statistically significant difference between several population means. In practice, it is usually used to compare three or more groups. However, in theory, it can also be done with only two groups.1

    In a previous post, we showed how to perform a one-way ANOVA in R. In this post, we illustrate how to conduct a one-way ANOVA by hand, via what is usually called an “ANOVA table”.

    Data and hypotheses

    To illustrate the method, suppose we take a sample of 12 students, divided equally into three classes (A, B and C) and we observe their age. Here is the sample:


    We are interested in comparing the population means between classes.

    Remember that the null hypothesis of the ANOVA is that all means are equal (i.e., age is not significantly different between classes), whereas the alternative hypothesis is that at least one mean is different from the other two (i.e., age is significantly different in at least one class compared to the other two). Formally, we have:

    • \(\mu_A = \mu_B = \mu_C\)
    • at least one mean is different

    ANOVA by hand

    As mentioned above, we are going to do an ANOVA table to conclude the test.

    Overall and group means

    We first need to compute the mean age by class (referred as the group means):

    • class A: \(\frac{24 + 31 + 26 + 23}{4} = 26\)
    • class B: \(\frac{24 + 21 + 19 + 24}{4} = 22\)
    • class C: \(\frac{15 + 21 + 18 + 18}{4} = 18\)

    and the mean age for the whole sample (referred as the overall mean):

    \[\begin{equation} \begin{split} & \frac{24 + 31 + 26 + 23 + 24 + 21 + 19 }{12} \\ &\frac{+ 24 + 15 + 21 + 18 + 18}{12} = 22 \end{split} \end{equation} \]

    SSR and SSE

    We then need to compute the sum of squares regression (SSR), and the sum of squares error (SSE).

    The SSR is computed by taking the square of the difference between the mean group and the overall mean, multiplied by the number of observations in the group:


    and then taking the sum of all cells:

    \[64+0+64 = 128 = SSR\]

    The SSE is computed by taking the square of the difference between each observation and its group mean:


    and then taking the sum of all cells:

    \[\begin{equation} \begin{split} & 4+25+0+9+4+1+9+4 \\ & +9+9+0+0 = 74 = SSE \end{split} \end{equation} \]

    For those interested in computing the sum of square total (SST), it is simply the sum of SSR and SSE, that is,

    \[\begin{equation} \begin{split} SST &= SSR + SSE\\ &= 128 + 74 \\ & =202 \end{split} \end{equation} \]

    ANOVA table

    The ANOVA table looks as follows (we leave it empty and we are going to fill it in step by step):


    We start to build the ANOVA table by plugging the SSR and SSE values found above into the table (in the “Sum.of.Sq.” column):


    The “Df” column corresponds to the degrees of freedom, and is computed as follows:

    • for the line regression: number of groups – 1 = 3 – 1 = 2
    • for the line error: number of observations – number of groups = 12 – 3 = 9

    With this information, the ANOVA table becomes:


    The “Mean.Sq.” column corresponds to the Mean Square, and is equal to the sum of square divided by the degrees of freedom, so the “Sum.of.Sq.” column divided by the “Df” column:


    Finally, the F-value corresponds to the ratio between the two mean squares, so \(\frac{64}{8.222} = 7.78\):


    This F-value gives the test statistic (also referred as \(F_{obs}\)), which needs to be compared with the critical value found in the Fisher table to conclude the test.

    We find the critical value in the Fisher table based on the degrees of freedom (those used in the ANOVA table) and based on the significance level. Suppose we take a significance level \(\alpha = 0.05\), the critical value can be found in the Fisher table as follows:

    So we have

    \[F_{2; 9; 0.05} = 4.26\]

    If you are interested to find this value with R, it can be found with the qf() function, where 0.95 corresponds to \(1 – \alpha\):

    qf(0.95, 2, 9)
    ## [1] 4.256495

    Conclusion of the test

    The rejection rule says that, if:

    • \(F_{obs} > F_{2; 9; 0.05} \Rightarrow\) we reject the null hypothesis
    • \(F_{obs} \le F_{2; 9; 0.05} \Rightarrow\) we do not reject the null hypothesis

    In our case,

    \[F_{obs} = 7.78 > F_{2; 9; 0.05} = 4.26\]

    \(\Rightarrow\) We reject the null hypothesis that all means are equal. In other words, it means that at least one class is different than the other two in terms of age.2

    To verify our results, here is the ANOVA table using R:

    ##             Df Sum Sq Mean Sq F value Pr(>F)  
    ## class        2    128   64.00   7.784 0.0109 *
    ## Residuals    9     74    8.22                 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    We found the same results by hand, but note that in R, the \(p\)-value is computed instead of comparing the \(F_{obs}\) with the critical value. The \(p\)-value can easily be found in R based on the \(F_{obs}\) and the degrees of freedom:

    pf(7.78, 2, 9,
      lower.tail = FALSE
    )
    ## [1] 0.010916

    Conclusion

    Thanks for reading.

    I hope this article helped you to conduct a one-way ANOVA by hand. See this tutorial if you want to learn how to do it in R.

    As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.


    1. In that case, a Student’s t-test is usually preferred over an ANOVA, although both tests will lead to the exact same conclusions.↩︎

    2. Remember that an ANOVA cannot tell you which group is different than the other in terms of the quantitative dependent variable, nor whether they are all different or if only one is different. To answer this question, post-hoc tests are required. This is beyond the scope of the present post, but it can easily be done in R (see this tutorial).↩︎

    To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Exit mobile version