ANOVA in R: afex may be the solution you are looking for
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Prelude: When you start with R
and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, R
kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base R
may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with R
and ANOVA I had enough and decided to develop afex
. If you know this feeling, afex
is also for you.
A new version of afex
(0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version. aov_ez
did not work with more than one covariate
(thanks to tkerwin
for reporting this bug).
I want to use this opportunity to introduce one of the main functionalities of afex
. It provides a set of functions that make calculating ANOVAs easy. In the default settings, afex
automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as SPSS
or SAS
. On top of this, the ANOVA object returned by afex (of class afex_aov
) can be directly used for follow-up or post-hoc tests/contrasts using the lsmeans
package .
Example Data
Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:
task
is a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (lexdec
; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (naming
; i.e., participant had to say the presented string out loud).stimulus
is a repeated-measures or within-subjects factor that codes whether a presented string was aword
ornonword
.length
is also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.
The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times, log_rt
. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see, afex
does this automatically for us (this is one of the features I blatantly stole from ez
).
library(afex) data("fhch2010") # load data (comes with afex) mean(!fhch2010$correct) # error rate # [1] 0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors str(fhch2010) # structure of the data # 'data.frame': 13222 obs. of 10 variables: # $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... # $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ... # $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... # $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ... # $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... # $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ... # $ item : Factor w/ 600 levels "abide","acts",..: 363 121 ... # $ rt : num 1.091 0.876 0.71 1.21 0.843 ... # $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... # $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
We first load the data and remove the roughly 2% errors. The structure of the data.frame
(obtained via str()
) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function aov_car()
which works very similar to base R aov()
, but as all afex
functions uses car::Anova()
(read as function Anova()
from package car
) as the backend for calculating the ANOVA.
Specifying an ANOVA
(a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch)) # Contrasts set to contr.sum for the following variables: task # Anova Table (Type 3 tests) # # Response: log_rt # Effect df MSE F ges p.value # 1 task 1, 43 0.23 13.38 *** .22 .0007 # 2 length 1.83, 78.64 0.00 18.55 *** .008 <.0001 # 3 task:length 1.83, 78.64 0.00 1.02 .0004 .36 # 4 stimulus 1, 43 0.01 173.25 *** .17 <.0001 # 5 task:stimulus 1, 43 0.01 87.56 *** .10 <.0001 # 6 length:stimulus 1.70, 72.97 0.00 1.91 .0007 .16 # 7 task:length:stimulus 1.70, 72.97 0.00 1.21 .0005 .30 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1 # # Sphericity correction method: GG # Warning message: # More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!
The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column Effect
, the degrees of freedoms (df
), the mean-squared error (MSE
, I would probably remove this column in a manuscript), the F-value (F
, which also contains the significance stars), and the p-value (p.value
). The only somewhat uncommon column is ges
which provides generalized eta-squared, 'the recommended effect size statistics for repeated measure designs' . The standard output also reports Greenhouse-Geisser (GG
) corrected df for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected df are not integers.
We can also see a warning notifying us that afex
has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using mean
. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying fun_aggregate = mean
explicitly in the call to aov_car
.
The formula passed to aov_car
basically needs to be the same as for standard aov
with a few differences:
- It must have an
Error
term specifying the column containing the participant (or unit of observation) identifier (e.g., minimally+Error(id)
). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor. - Repeated-measures factors only need to be defined in the
Error
term and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA:aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)
In addition to aov_car
, afex
provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.
aov_ez
allows the ANOVA specification not via aformula
but viacharacter
vectors (and is similar toez::ezANOVA()
):aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))
aov_4
requires a formula for which theid
and repeated-measures factors need to be specified as inlme4::lmer()
(with the same simplification that repeated-measures factors only need to be specified in the random part):aov_4(log_rt ~ task + (length*stimulus|id), fhch) aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)
Follow-up Tests
A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose, afex
is fully integrated with lsmeans
.
For example, assume we are interested in the significant task:stimulus
interaction. As a first step we might want to investigate the marginal means of these two factors:
lsmeans(a1, c("stimulus","task")) # NOTE: Results may be misleading due to involvement in interactions # stimulus task lsmean SE df lower.CL upper.CL # word naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword naming -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # word lexdec 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword lexdec 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95
From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between word
and nonword
trials is larger in the naming
task then in the lexdec
task. We test this with the following code using a few different lsmeans
function. We first use lsmeans
again, but this time using task
as the conditioning variable specified in by
. Then we use pairs()
for obtaining all pairwise comparisons within each conditioning strata (i.e., level of task
). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply update()
the returned results and remove the conditioning by setting by=NULL
. In the call to update we can already specify the method for error control and we specify 'holm'
, because it is uniformly more powerful than Bonferroni.
# set up conditional marginal means: (ls1 <- lsmeans(a1, c("stimulus"), by="task")) # task = naming: # stimulus lsmean SE df lower.CL upper.CL # word -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # # task = lexdec: # stimulus lsmean SE df lower.CL upper.CL # word 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95 update(pairs(ls1), by=NULL, adjust = "holm") # contrast task estimate SE df t.ratio p.value # word - nonword naming -0.31424037 0.02080113 43 -15.107 <.0001 # word - nonword lexdec -0.05309159 0.01860509 43 -2.854 0.0066 # # Results are averaged over the levels of: length # P value adjustment: holm method for 2 tests
Hmm. These results show that the stimulus
effects in both task
conditions are independently significant. Obviously, the difference between them must also be significant then, or?
pairs(update(pairs(ls1), by=NULL)) # contrast estimate SE df t.ratio p.value # wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43 -9.358 <.0001
They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the F-value of the interaction using lsmeans
alone by invoking yet another of its functions, test(..., joint=TRUE)
.
test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE) # df1 df2 F p.value # 1 43 87.565 <.0001
These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of lsmeans
. Any set of estimated marginal means produced by lsmeans
, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with test()
we can even obtain joint F-tests over several parameters using joint=TRUE
. lsmeans
is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with rstanarm
so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).
Finally, lsmeans
can also be used directly for plotting by envoking lsmip
:
lsmip(a1, task ~ stimulus)
Note that lsmip
does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only 'if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means -- which is almost never the case -- it would be appropriate to use overlapping confidence intervals to test comparisons of means' (lsmeans author Russel Lenth, the link provides an alternative).
We can also use lsmeans
in combination with lattice
to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.
lsm1 <- summary(lsmeans(a1, c("stimulus","task"))) lsm1$lsmean <- exp(lsm1$lsmean) require(lattice) xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b", auto.key = list(space = "right"))
Summary
afex
provides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy:aov_car()
,aov_ez()
, andaov_4()
.- In its default settings, the
afex
ANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares). - Fitted ANOVA models can be passed to
lsmeans
for follow-up tests, custom contrast tests, and plotting. - For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
- For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.
As a caveat, let me end this post with some cautionary remarks from Douglas Bates (fortunes::fortune(184)
) who explains why ANOVA in R
is supposed to not be the same as in other software packages (i.e., he justifies why it 'sucks'):
You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of "give me every possible statistic that could be calculated from this model, whether or not it makes sense". The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.
-- Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)
R-help (March 2007)
Maybe he is right, but maybe what I have described here is useful to some degree.
References
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.