Longitudinal analysis: autocorrelation makes a difference

Luis

10 years ago

[This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back to posting after a long weekend and more than enough rugby coverage to last a few years. Anyway, back to linear models, where we usually assume normality, independence and homogeneous variances. In most statistics courses we live in a fantasy world where we meet all of the assumptions, but in real life—and trees and forests are no exceptions—there are plenty of occasions when we can badly deviate from one or more assumptions. In this post I present a simple example, where we have a number of clones (genetically identical copies of a tree), which had between 2 and 4 cores extracted, and each core was assessed for acoustic velocity (we care about it because it is inversely related to longitudinal shrinkage and its square is proportional to wood stiffness) every two millimeters. This small dataset is only a pilot for a much larger study currently underway.

At this stage I will ignore any relationship between the clones and focus on the core assessements. Let’s think for a moment; we have five replicates (which restrict the randomization) and four clones (A, B, C and D). We have (mostly) 2 to 4 cores (cylindrical pieces of wood covering from tree pith to cambium) within each tree, and we have longitudinal assessments for each core. I would have the expectation that, at least, successive assessments for each core are not independent; that is, assessments that are closer together are more similar than those that are farther apart. How does the data look like? The trellis plot shows trees using a Clone:Rep notation:

library(lattice)
xyplot(velocity ~ distance | Tree, group=Core,
                  data=cd, type=c('l'))

Incidentally, cores from Clone C in replicate four were damaged, so I dropped them from this example (real life is unbalanced as well!). Just in case, distance is in mm from the tree pith and velocity in m/s. Now we will fit an analysis that totally ignores any relationship between the successive assessments:

library(nlme)
lm1a = lme(ACV ~ Clone*Distance,
                 random = ~ 1|Rep/Tree/Core, data=cd)
summary(lm1a)

Linear mixed-effects model fit by REML
 Data: cd
      AIC      BIC   logLik
  34456.8 34526.28 -17216.4

Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:    120.3721

 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    77.69231

 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:    264.6254 285.9208

Fixed effects: ACV ~ Clone * Distance
                     Value Std.Error   DF  t-value p-value
(Intercept)       3274.654 102.66291 2379 31.89715  0.0000
CloneB             537.829 127.93871   11  4.20380  0.0015
CloneC             209.945 137.10691   11  1.53125  0.1539
CloneD             293.840 124.08420   11  2.36807  0.0373
Distance            14.220   0.28607 2379 49.70873  0.0000
CloneB:distance     -0.748   0.44852 2379 -1.66660  0.0957
CloneC:distance     -0.140   0.45274 2379 -0.30977  0.7568
CloneD:distance      3.091   0.47002 2379  6.57573  0.0000

anova(lm1a)
               numDF denDF  F-value p-value
(Intercept)        1  2379 3847.011  <.0001
Clone              3    11    4.054  0.0363
distance           1  2379 7689.144  <.0001
Clone:distance     3  2379   22.468  <.0001

Incidentally, our assessment setup looks like this. The nice thing of having good technicians (Nigel made the tool frame), collaborating with other departments (Electrical Engineering, Michael and students designed the electronics and software for signal processing) and other universities (Uni of Auckland, where Paul—who cored the trees and ran the machine—works) is that one gets involved in really cool projects.

What happens if we actually allow for an autoregressive process?

lm1b = lme(velocity ~ Clone*distance,
                      random = ~ 1|Rep/Tree/Core, data = cd,
                      correlation = corCAR1(value = 0.8,
                      form = ~ distance | Rep/Tree/Core))
summary(lm1b)

Linear mixed-effects model fit by REML
 Data: ncd
       AIC      BIC    logLik
  29843.45 29918.72 -14908.73

Random effects:
 Formula: ~1 | Rep
        (Intercept)
StdDev:     60.8209

 Formula: ~1 | Tree %in% Rep
        (Intercept)
StdDev:    125.3225

 Formula: ~1 | Core %in% Tree %in% Rep
        (Intercept) Residual
StdDev:   0.3674224 405.2818

Correlation Structure: Continuous AR(1)
 Formula: ~distance | Rep/Tree/Core
 Parameter estimate(s):
      Phi
0.9803545
Fixed effects: velocity ~ Clone * distance
                   Value Std.Error   DF   t-value p-value
(Intercept)     3297.517 127.98953 2379 25.763960  0.0000
CloneB           377.290 183.16795   11  2.059804  0.0639
CloneC           174.986 195.21327   11  0.896383  0.3892
CloneD           317.581 178.01710   11  1.783994  0.1020
distance          15.209   1.26593 2379 12.013979  0.0000
CloneB:distance    0.931   1.94652 2379  0.478342  0.6325
CloneC:distance   -0.678   2.00308 2379 -0.338629  0.7349
CloneD:distance    2.677   1.95269 2379  1.371135  0.1705

anova(lm1b)
               numDF denDF  F-value p-value
(Intercept)        1  2379 5676.580  <.0001
Clone              3    11    2.483  0.1152
distance           1  2379  492.957  <.0001
Clone:distance     3  2379    0.963  0.4094

In ASReml-R this would look like (for the same results, but many times faster):

as1a = asreml(velocity ~ Clone*distance,
                         random = ~ Rep + Tree/Core,
                         data = cd)
summary(as1a)
anova(as1a)

# I need to sort out my code for ar(1) and update to
# the latest version of asreml-r

Oops! What happened to the significance of Clone and its interaction with distance? The perils of ignoring the independence assumption. But, wait, isn’t an AR(1) process too simplistic to model the autocorrelation (as pointed out by D.J. Keenan when criticizing IPCC’s models and discussing Richard Mueller’s new BEST project models)? In this case, probably not, as we have a mostly increasing response, where we have a clue of the processes driving the change and with far less noise than climate data.

Could we improve upon this model? Sure! We could add heterogeneous variances, explore non-linearities, take into account the genetic relationship between the trees, run the whole thing in asreml (so it is faster), etc. Nevertheless, at this point you can get an idea of some of the issues (or should I call them niceties?) involved in the analysis of experiments.

To leave a comment for the author, please follow the link and comment on their blog: Quantum Forest » rblogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.