One-way Analysis of Variance (ANOVA)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way ANOVA which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared.
In one-way ANOVA the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is treatment even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor.
As an example we consider one of the data sets available with R relating to an experiment into plant growth. The purpose of the experiment was to compare the yields on the plants for a control group and two treatments of interest. The response variable was a measurement taken on the dried weight of the plants.
The first step in the investigation is to take a copy of the data frame so that we can make some adjustments as necessary while leaving the original data alone. We use the factor function to re-define the labels of the group variables that will appear in the output and graphs:
plant.df = PlantGrowth plant.df$group = factor(plant.df$group, labels = c("Control", "Treatment 1", "Treatment 2"))
The labels argument is a list of names corresponding to the levels of the group factor variable.
A boxplot of the distributions of the dried weights for the three competing groups is created using the ggplot package:
require(ggplot2) ggplot(plant.df, aes(x = group, y = weight)) + geom_boxplot(fill = "grey80", colour = "blue") + scale_x_discrete() + xlab("Treatment Group") + ylab("Dried weight of plants")
The geom_boxplot() option is used to specify background and outline colours for the boxes. The axis labels are created with the xlab() and ylab() options. The plot that is produce looks like this:
Initial inspection of the data suggests that there are differences in the dried weight for the two treatments but it is not so clear cut to determine whether the treatments are different to the control group. To investigate these differences we fit the one-way ANOVA model using the lm function and look at the parameter estimates and standard errors for the treatment effects. The function call is:
plant.mod1 = lm(weight ~ group, data = plant.df)
We save the model fitted to the data in an object so that we can undertake various actions to study the goodness of the fit to the data and other model assumptions. The standard summary of a lm object is used to produce the following output:
> summary(plant.mod1) Call: lm(formula = weight ~ group, data = plant.df) Residuals: Min 1Q Median 3Q Max -1.0710 -0.4180 -0.0060 0.2627 1.3690 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0320 0.1971 25.527 <2e-16 *** groupTreatment 1 -0.3710 0.2788 -1.331 0.1944 groupTreatment 2 0.4940 0.2788 1.772 0.0877 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6234 on 27 degrees of freedom Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096 F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
The model output indicates some evidence of a difference in the average growth for the 2nd treatment compared to the control group. An analysis of variance table for this model can be produced via the anova command:
> anova(plant.mod1) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) group 2 3.7663 1.8832 4.8461 0.01591 * Residuals 27 10.4921 0.3886 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This table confirms that there are differences between the groups which were highlighted in the model summary. The function confint is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals:
> confint(plant.mod1) 2.5 % 97.5 % (Intercept) 4.62752600 5.4364740 groupTreatment 1 -0.94301261 0.2010126 groupTreatment 2 -0.07801261 1.0660126
The model residuals can be plotted against the fitted values to investigate the model assumptions. First we create a data frame with the fitted values, residuals and treatment identifiers:
plant.mod = data.frame(Fitted = fitted(plant.mod1), Residuals = resid(plant.mod1), Treatment = plant.df$group)
and then produce the plot:
ggplot(plant.mod, aes(Fitted, Residuals, colour = Treatment)) + geom_point()
which produces this graph:
We can see that there is no major problem with the diagnostic plot but some evidence of different variabilities in the spread of the residuals for the three treatment groups.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.