My datasets tend to have random factors. I try to stick to general models whenever I can to avoid dealing with both random factors and complex error distributions (not always possible). I am compiling some notes here to avoid visiting the same R-forums every time I work in a new project. Those are some notes to self, but may be they are useful to someone else. Note that those are oversimplified notes, they assume you know your stats and your R and only want a cheat sheet. Also, they may contain errors/discrepancies with your philosophy (please notify me if you find errors, so I can update it!)
Only piece of advice. Plot everything and understand your data.
I’ll designate:
Response variable –this is what we want to model as y
.
Predictors –This is what we think is affecting the response variable as x
, z
when continuos, and A
, B
when discrete with levels denoted by A1
, A2
, etc…
Models abbreviated as m
.
Let’s start assuming normal error distributions:
1) Easy linear model with one predictor:
m <- lm(y ~ x)
then check residuals. You want to see no pattern in the following plot:
scatter.smooth(residuals(m)~fitted(m))
and a straight line here
qqplot(residuals(m))
qqline(residuals(m))
more diagnostics here: http://www.statmethods.net/stats/rdiagnostics.html
2) Let’s say you have also a Covariate z
m <- lm(y ~ z + x)
summary(m)
coef(m)
confint(m)
If you want the anova table
anova(m)
But the covariable should be added first because order matters. The result will be the variance explained by x after accounting for the variance due to A (that’s call a Type 1 error)
3) Otherwise what you actually have is not a covariable, but two predictors:
m <- lm(y ~ z + x)
summary(m) #same here
library(car)
Anova(m, Type = "II")
If there is indeed no interaction, then type II is statistically more powerful than type III
If the interaction is significant type III is more appropriate
m <- lm(y ~ z * x)
summary(m)
library(car)
Anova(m, Type = "III")
note: when your design is classic balanced ANOVA design, it doesn’t matter as all types gives you the same answer. More info here: http://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
To plot interactions:
interaction.plot(A, B, y)
#only works with A and B factors, so it
rarely works for me
Or customize it by discretizing one of the variables (e.g. cut(z)
) and fit lines for the two/three levels of z using the coefficients
4) How? By understanding the output, tricky specially when covariate is a factor.
m <- lm(y ~ A * x)
summary(m)
- (Intercept) Value is the intercept for A1 (v1)
- A2 value (v2) is the difference in the intercept between A1 and A2 (the intercept for A2 is calculated as v1-v2)
- x Value is the slope for x given A1 (v3)
- x:A2 Value (v4) is the difference in the slope for x between A1 and A2 (the slope of x for A2 is calculated as v3-v4)
5) Select models (by AIC
or compare two models anova
I know confusing name)
AIC(m)
library(MuMIn)
AICc(m) #for small sample size
anova(m1,m2)
But see below with random factors
6) Until now we wanted to see if there were differences among levels of A.
A included as fixed factor -> “the expectation of the distribution of effects”
lm(x ~ A + z)
But if we want to see if there are general trends regardless of A
, but we want to control for the variability due to A
. –> “average effect in the population”, we use:
a) random intercept model (we expect the trend to be equal, but some plants will differ on the intercept)
library(nlme)
lme(y ~ x, random = ~1 | A)
b) random slope model (we expect the trend to be dependent on species identity)
lme(y ~ x, random = ~1 + x | A)
In addition, now we can extract the random effects
m$coefficients
fixed and random factors, so we can see the random intercepts (and slopes) as well as
intervals(m)
7) To compare models in nlme
With different radom structures you should use method="REML"
as an option. To compare models with different fix effects structure using anova
use ML
, but give final p-values with REML
8) With complex models you can drege them to find the best ones
library(MuMIn)
m_set = dredge(m_full)
(top_m = get.models(m_set, subset = delta<3))
and compute averages the regular way without shrinkage:
model.avg(top_m)$avg.model
See also names
and str(model.avg(top_m))
Or estimates with shrinkage (zero method) – this makes the most difference for the less important predictors since they will get averaged with a bunch of zeros, probably. pmodel.avg(top_m)$coef.shrinkage
9) But what if residuals are bad?
Don’t panic yet. Some times is because variance is different across groups/values, you can investigate that by:
boxplot(A, residuals(m, type = "norm"))
plot(A, residuals(m, type = "norm"))
and fix it by incuding a variance factor in the models weights=varIdent(form=~1|A)
(for factors) or weights= varPower(form=~x)
for continous variables (you can also try VarExp
or varConstPower
)
And to check if it worked:
plot(residuals(m, type = "normalized") ~ fitted(m))
The ref is Cleasby IR, Nakagawa S. 2011. Neglected biological patterns in the residuals. Behavioral Ecology and Sociobiology, 65(12), 2361-2372.
note that lm
can not use the weights option, but I think, but you can use gls
, instead.
10) Is also recomended to center scale(x, scale = FALSE)
or scale scale(x)
predictors.
Reference: Schielzeth, H. (2010), Simple means to improve the interpretability of regression coefficients. Methods in Ecology and Evolution, 1: 103–113.
11) You should also worry about co-linearity
library(car)
vif(m)
Should be lower than 2.5, and you should be very concerned when is over 10
or for lme’s load this little function from here (https://github.com/aufrank/R-hacks/blob/master/mer-utils.R)
vif.mer <- function (fit) {
## adapted from rms::vif
v <- vcov(fit)
nam <- names(fixef(fit))
## exclude intercepts
ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)] }
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam v }
vif.mer(m)
12) and you may want to get R2 for glm also…
Based in Nakagawa, S., Schielzeth, H. (2013), A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4: 133–142.
library(MuMIn)
r.squaredGLMM(m)
——————————————————————————————————
More details and plotting here: https://www.zoology.ubc.ca/~schluter/R/fit-model/
For GLMM’s (generalized) there is a whole new set of tricky things, maybe will be another gist.