Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One-way ANOVA is a test used to assess whether there is a statistically significant difference between the mean of groups.
There is 1 response numeric variable and 1 explanatory categorical variable with more than 1 level.
ANOVA considers the probability of observing the sample ratio of explained variance to unexplained variance (i.e. the F statistic)… if the null hypothesis is true that all population means are equal.
The higher this ratio of explained to unexplained variance is (i.e. the F statistic), then the lower the probability of observing this if our null hypothesis was correct.
You can perform a one-way ANOVA very easily in R using the aov
function etc. But what fun would that be?!
On a serious note, it is really helpful to understand how the algorithm of the test works to calculate it ‘by hand’.
First, let’s load some libraries, and create some sample data.
library(tidyverse) library(tidymodels) library(simplevis) cat_var <- rep(c(8, 16, 32, 64), each = 8) num_var <- c(1.4, 2.0, 3.2, 1.4, 2.3, 4.0, 5.0, 4.7, 3.2, 6.8, 5.0, 2.5, 6.1, 4.8, 4.6, 4.2, 6.2, 3.1, 3.2, 4.0, 4.5, 6.4, 4.4, 4.1, 5.8, 6.6, 6.5, 5.9, 5.9, 3.0, 5.9, 5.6) data <- tibble(cat_var, num_var) %>% mutate(cat_var = as.factor(cat_var)) head(data, n = 3) ## # A tibble: 3 x 2 ## cat_var num_var ## <fct> <dbl> ## 1 8 1.4 ## 2 8 2 ## 3 8 3.2
Next, let’s visualise the data.
ggplot_boxplot(data, cat_var, num_var, y_zero = T, title = NULL, x_title = "Catagorical explanatory variable", y_title = "Numeric response variable")
I’m going to use a fixed effects means model for simplicity (Model I).
Yij = μi + Eij
The hypotheses are as follow.
H0: all population means are equal
H1: not all population means are equal
We will use a significance level of p = 0.05.
Note the mean square error (MSE) is used to quantify variance, and this is obtained by dividing the sum of squared error (SSE) by the degrees of freedom (df).
- Calculate SSE, df and MSE for if H1 is TRUE (i.e. complete model)
complete <- data %>% group_by(cat_var) %>% mutate(fit = mean(num_var)) %>% #in complete model, fit is the group mean mutate(error = fit - num_var) %>% mutate(sq_error = error ^ 2) %>% ungroup() head(complete, 3) ## # A tibble: 3 x 5 ## cat_var num_var fit error sq_error ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 8 1.4 3 1.6 2.56 ## 2 8 2 3 1 1 ## 3 8 3.2 3 -0.200 0.0400 (complete <- complete %>% summarise(n = n(), sse = sum(sq_error)) %>% mutate(p = length(unique(data$cat_var))) %>% mutate(df = n - p) %>% select(sse, df, n, p) %>% mutate(mse = sse / df) # mse = sigma squared ) ## # A tibble: 1 x 5 ## sse df n p mse ## <dbl> <int> <int> <int> <dbl> ## 1 47.8 28 32 4 1.71
- Calculate SSE, df and MSE for if H0 is TRUE (i.e. reduced model)
reduced <- data %>% mutate(fit = mean(num_var)) %>% #in reduced model, fit is the overall mean mutate(error = fit - num_var) %>% mutate(sq_error = error ^ 2) %>% ungroup() head(reduced, 3) ## # A tibble: 3 x 5 ## cat_var num_var fit error sq_error ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 8 1.4 4.45 3.05 9.28 ## 2 8 2 4.45 2.45 5.99 ## 3 8 3.2 4.45 1.25 1.55 (reduced <- reduced %>% summarise(n = n(), sse = sum(sq_error)) %>% # sse = sum square error mutate(p = 1) %>% mutate(df = n - p) %>% select(sse, df, n, p) ) ## # A tibble: 1 x 4 ## sse df n p ## <dbl> <dbl> <int> <dbl> ## 1 76.4 31 32 1
- Calculate the SSE, df and MSE explained by the complete model.
(explained_sse <- reduced$sse - complete$sse) ## [1] 28.67094 (explained_df <- reduced$df - complete$df) ## [1] 3 (explained_mse <- explained_sse / explained_df) ## [1] 9.556979
- Calculate the ratio of variance (i.e. MSE) the complete model has explained to the variance (i.e. MSE) that is left unexplained. Then determine the probability of this statistic, if the null hypothesis was true.
(f <- explained_mse / complete$mse) ## [1] 5.601893 pf(f, explained_df, complete$df, lower.tail = F) ## [1] 0.003871857
There was a p < 0.05, so we will reject the null hypothesis, and accept that the population group means are not the same.
Let’s check whether we got it right… We did!
anova <- aov(num_var ~ cat_var, data) summary(anova) ## Df Sum Sq Mean Sq F value Pr(>F) ## cat_var 3 28.67 9.557 5.602 0.00387 ** ## Residuals 28 47.77 1.706 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Of course, we should also check the assumptions:
- independence of observations
- constant variation between treatments
- approximately normally distributed within treatments.
We can do the last 2 visually, and the 1st by understanding the study design.
data <- augment(anova) ggplot_boxplot(data, cat_var, .resid, y_balance = TRUE, title = "Residual error across groups", x_title = "Group", y_title = "Residual error")
ggplot(data) + geom_qq(aes(sample = .resid)) + geom_qq_line(aes(sample = .resid)) + labs(title = "Normal Q-Q plot", x = "Theoretical normal distribution quantile", y = "Residual error quantile") + theme_point()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.