[This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post was originally part of my previous post about linear models. However, I later decided to split it into several texts because it was effectively too long and complex to navigate.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): Linear Models (lm, ANOVA and ANCOVA) in Agriculture
Linear Mixed-Effects Models
This class of models are used to account for more than one source of random variation. For example, assume we have a dataset where again we are trying to model yield as a function of nitrogen level. However, this time the data were collected in many different farms. In this case would need to be consider a cluster and the model would need to take this clustering into account. Another common set of experiments where linear mixed-effects models are used is repeated measures where time provide an additional source of correlation between measures. For these models we do not need to worry about the assumptions from previous models, since these are very robust against all of them. For example, for unbalanced design with blocking, probably these methods should be used instead of the standard ANOVA.< o:p>
At the beginning on this tutorial we explored the equation that supports linear model:< o:p>
This equation can be seen as split into two components, the fixed and random effects. For fixed effect we refer to those variables we are using to explain the model. These may be factorial (in ANOVA), continuous or a mixed of the two (ANCOVA) and they can also be the blocks used in our design. The other component in the equation is the random effect, which provides a level of uncertainty that it is difficult to account in the model. For example, when we work with yield we might see differences between plants grown from similar soils and conditions. These may be related to the seeds or to other factors and are part of the within-subject variation that we cannot explain.< o:p>
There are times however where in the data there are multiple sources of random variation. For example, data may be clustered in separate field or separate farms. This will provide an additional source of random variation that needs to be taken into account in the model. To do so the standard equation can be amended in the following way:< o:p>
This is referred to as a random intercept model, where the random variation is split into a cluster specific variation u and the standard error term. A more complex form, that is normally used for repeated measures is the random slope and intercept model:< o:p>
Where we add a new source of random variation v related to time T.< o:p>
Random Intercept Model for Clustered Data
In the following examples we will use the function lme in the package nlme, so please install and/or load the package first. For the dataset please look at the previous post.
Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describe some clusters in the data. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:< o:p>
Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describe some clusters in the data. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:< o:p>
lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)
The syntax is very similar to all the models we fitted before, with a general formula describing our target variable yield and all the treatments, which are the fixed effects of the model. Then we have the option random, which allows us to include an additional random component for the clustering factor rep. In this case the ~1 indicates that the random effect will be associated with the intercept.
< o:p>
Once again we can use the function summary to explore our results:
> summary(lme1) Linear mixed-effects model fit by REML Data: dat AIC BIC logLik 27648.36 27740.46 -13809.18 Random effects: Formula: ~1 | rep (Intercept) Residual StdDev: 0.798407 13.3573 Fixed effects: yield ~ nf + bv * topo Value Std.Error DF t-value p-value (Intercept) 327.3304 14.782524 3428 22.143068 0 nfN1 3.9643 0.788049 3428 5.030561 0 nfN2 5.2340 0.790104 3428 6.624449 0 nfN3 5.4498 0.789084 3428 6.906496 0 nfN4 7.5286 0.789551 3428 9.535320 0 nfN5 7.7254 0.789111 3428 9.789976 0 bv -1.4685 0.085507 3428 -17.173569 0 topoHT -233.3675 17.143956 3428 -13.612232 0 topoLO -251.9750 20.967003 3428 -12.017693 0 topoW -146.4066 16.968453 3428 -8.628162 0 bv:topoHT 1.1945 0.097696 3428 12.226279 0 bv:topoLO 1.4961 0.123424 3428 12.121624 0 bv:topoW 0.7873 0.097865 3428 8.044485 0
We can also use the function Anovato display the ANOVA table:< o:p>
> Anova(lme2, type=c("III")) Analysis of Deviance Table (Type III tests) Response: yield Chisq Df Pr(>Chisq) (Intercept) 752.25 1 < 2.2e-16 *** nf 155.57 5 < 2.2e-16 *** bv 291.49 1 < 2.2e-16 *** topo 236.52 3 < 2.2e-16 *** year 797.13 1 < 2.2e-16 *** bv:topo 210.38 3 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We might be interested in understanding if fitting a more complex model provides any advantage in terms of accuracy, compared with a model where no additional random effect is included. To do so we can compare this new model with mod6, which we created with the glsfunction and includes the same treatment structure. We can do that with the function anova, specifying both models:< o:p>
> anova(lme1, mod6) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275
As you can see there is a decrease in AIC for the model fitted with lme, and the difference is significant (p-value below 0.05). Therefore this new model where clustering is accounted for is better than the one without an additional random effect, even though only slightly. In this case we would need to decide if fitting a more complex model (which is probably more difficult to explain to readers) is the best way.< o:p>
Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R-Squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:
Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R-Squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:
library(MuMIn) > r.squaredLR(mod6) [1] 0.5469906 attr(,"adj.r.squared") [1] 0.5470721 > r.squaredGLMM(lme1) R2m R2c 0.5459845 0.5476009
The first function r.squaredLR can be used for GLS models and provides both and R-Squared and an Adjusted R-Squared. The second function, r.squaredGLMM, is specific for mixed-effects models and provides two measures: R2m and R2c. The first reports the R2 of the model with just fixed effects, while the second the R2 of the full model.
In this case we can see again that the R2 are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.
In this case we can see again that the R2 are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.
Random Intercept and Slope for repeated measures
If we collected data at several time steps we are looking at a repeated measures analysis. The code to create such a model is the following:< o:p>
lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat) summary(lme2) Anova(lme2, type=c("III"))
The syntax is very similar to what we wrote before, except that now the random component includes both time and clusters. Again we can use summary and Anovato get more info about the model. We can also use again the function anova to compare this with the previous model:< o:p>
> anova(lme1, lme2) Model df AIC BIC logLik Test L.Ratio p-value lme1 1 15 27648.36 27740.46 -13809.18 lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001 Warning message: In anova.lme(lme1, lme2) : fitted objects with different fixed effects. REML comparisons are not meaningful.
From this output it is clear that the new model is better that the one before and their difference in highly significant.< o:p>
We can extract only the effects for the random components using the function ranef:< o:p>
> ranef(lme2) (Intercept) year R1 -0.3468601 -1.189799e-07 R2 -0.5681688 -1.973702e-07 R3 0.9150289 3.163501e-07
This tells us the changes in yield for each cluster and time step. We can also do the same for the fixed effects, and this will return the coefficients of the model:< o:p>
> fixef(lme2) (Intercept) nfN1 nfN2 nfN3 nfN4 -1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00 nfN5 bv topoHT topoLO topoW 7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02 year bv:topoHT bv:topoLO bv:topoW 5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01
To have an idea of their confidence interval we can use the function intervals:< o:p>
> intervals(lme2, which = "fixed") Approximate 95% confidence intervals Fixed effects: lower est. upper (Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04 nfN1 2.526139e+00 3.918006e+00 5.309873e+00 nfN2 3.736625e+00 5.132136e+00 6.527648e+00 nfN3 3.974809e+00 5.368513e+00 6.762216e+00 nfN4 6.070018e+00 7.464542e+00 8.859065e+00 nfN5 6.245584e+00 7.639337e+00 9.033089e+00 bv -1.469793e+00 -1.318391e+00 -1.166989e+00 topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02 topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02 topoW -1.436741e+02 -1.136168e+02 -8.355954e+01 year 5.414742e+00 5.818826e+00 6.222911e+00 bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00 bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00 bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01 attr(,"label") [1] "Fixed effects:"
Syntax with lme4
Another popular package to perform mixed-effects models we could also use the package lme4 and the function lmer.
For example, to fit the model with random intercept (what we called lme1) we would use the following syntax in lme4:
> lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat) > > summary(lmer1) Linear mixed model fit by REML ['lmerMod'] Formula: yield ~ nf + bv * topo + (1 | rep) Data: dat REML criterion at convergence: 27618.4 Scaled residuals: Min 1Q Median 3Q Max -3.4267 -0.7767 -0.1109 0.7196 3.6892 Random effects: Groups Name Variance Std.Dev. rep (Intercept) 0.6375 0.7984 Residual 178.4174 13.3573 Number of obs: 3443, groups: rep, 3 Fixed effects: Estimate Std. Error t value (Intercept) 327.33043 14.78252 22.143 nfN1 3.96433 0.78805 5.031 nfN2 5.23400 0.79010 6.624 nfN3 5.44980 0.78908 6.906 nfN4 7.52862 0.78955 9.535 nfN5 7.72537 0.78911 9.790 bv -1.46846 0.08551 -17.174 topoHT -233.36750 17.14396 -13.612 topoLO -251.97500 20.96700 -12.018 topoW -146.40655 16.96845 -8.628 bv:topoHT 1.19446 0.09770 12.226 bv:topoLO 1.49609 0.12342 12.122 bv:topoW 0.78727 0.09786 8.044 Correlation matrix not shown by default, as p = 13 > 12. Use print(x, correlation=TRUE) or vcov(x) if you need it
There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.
As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:
> lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat) > > summary(lmer1) Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [ lmerMod] Formula: yield ~ nf + bv * topo + (1 | rep) Data: dat REML criterion at convergence: 27618.4 Scaled residuals: Min 1Q Median 3Q Max -3.4267 -0.7767 -0.1109 0.7196 3.6892 Random effects: Groups Name Variance Std.Dev. rep (Intercept) 0.6375 0.7984 Residual 178.4174 13.3573 Number of obs: 3443, groups: rep, 3 Fixed effects: Estimate Std. Error df t value Pr(>|t|) (Intercept) 327.33043 14.78252 3411.00000 22.143 < 2e-16 *** nfN1 3.96433 0.78805 3428.00000 5.031 5.14e-07 *** nfN2 5.23400 0.79010 3428.00000 6.624 4.03e-11 *** nfN3 5.44980 0.78908 3428.00000 6.906 5.90e-12 *** nfN4 7.52862 0.78955 3428.00000 9.535 < 2e-16 *** nfN5 7.72537 0.78911 3428.00000 9.790 < 2e-16 *** bv -1.46846 0.08551 3428.00000 -17.174 < 2e-16 *** topoHT -233.36750 17.14396 3429.00000 -13.612 < 2e-16 *** topoLO -251.97500 20.96700 3430.00000 -12.018 < 2e-16 *** topoW -146.40655 16.96845 3430.00000 -8.628 < 2e-16 *** bv:topoHT 1.19446 0.09770 3429.00000 12.226 < 2e-16 *** bv:topoLO 1.49609 0.12342 3430.00000 12.122 < 2e-16 *** bv:topoW 0.78727 0.09786 3430.00000 8.044 1.33e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation matrix not shown by default, as p = 13 > 12. Use print(x, correlation=TRUE) or vcov(x) if you need it
As you can see now the p-values are showing and we can assess the significance for each term.
Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.
When we are dealing with random slope and intercept we would use the following syntax:
lmer2 = lmer(yield ~ nf + bv * topo + year + (year|rep), data=dat)
To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.