Dealing with correlation in designed field experiments: part II

R on Fixing the bridge between biologists and statisticians

23 hours ago

[This article was first published on R on Fixing the bridge between biologists and statisticians, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

With field experiments, studying the correlation between the observed traits may not be an easy task. For example, we can consider a genotype experiment, laid out in randomised complete blocks, with 27 wheat genotypes and three replicates, where several traits were recorded, including yield (Yield) and weight of thousand kernels (TKW). We might be interested in studying the correlation between those two traits, but we would need to face two fundamental problems:

the concept of correlation in such a setting is not unique, as we might either consider the correlation between the plot measurements, or the correlation between the residuals or the correlation between genotype means or the correlation between block means;
the experimental units are not independent, but they are grouped by genotype and block, which invalidate all inferences based on the independence assumption.

I have dealt with these two problems (particularly the first one) in a previous post, where I gave a solution based on traditional methods of data analyses.

In this post, I would like to present a more advanced solution, that was advocated by Hans-Peter Piepho in a relatively recent manuscript (Piepho, 2018). Such a solution is based on mixed models and it was implemented in SAS, by using PROC MIXED. I spent a few hours ‘transporting’ those models in R, which turned out to be a difficult task, although, in the end, I seem to have came to an acceptable solution, which I would like to share here.

First of all, we can load the ‘WheatQuality.csv’ dataset, that is available is available on an Internet repository; it consists of 81 records (plots) of 6 variables, i.e. the Genotype and Block factors, as well as the four responses height, TKW, weight per hectolitre and yield. The code below loads the necessary packages, the data and transforms the numeric variable ‘Block’ into a factor.

rm(list = ls())
library(dplyr)
library(sommer)
library(nlme)
#
# Loading data
dataset <- read.csv("https://www.casaonofri.it/_datasets/WheatQuality.csv") |>
  mutate(Block = factor(Block))
head(dataset)
##     Genotype Block Height  TKW Whectol Yield
## 1 arcobaleno     1     90 44.5    83.2 64.40
## 2 arcobaleno     2     90 42.8    82.2 60.58
## 3 arcobaleno     3     88 42.7    83.1 59.42
## 4       baio     1     80 40.6    81.8 51.93
## 5       baio     2     75 42.7    81.3 51.34
## 6       baio     3     76 41.1    81.1 47.78

Piepho (2018) showed that, for an experiment like this one, all the correlation coefficients can be estimated by coding a multi-response mixed model, as follows:

\[ Y_{ijk} = \mu_i + \beta_{ik} + \tau_{ij} + \epsilon_{ijk}\]

where \(Y_{ijk}\) is the response for the trait \(i\), the genotype \(j\) and the block \(k\), \(\mu_i\) is the mean for the trait \(i\), \(\beta_{ik}\) is the effect of the block \(k\) and trait \(i\), \(\tau_{ij}\) is the effect of genotype \(j\) for the trait \(i\) and \(\epsilon_{ijk}\) is the residual for the trait \(i\), the genotype \(j\) and the block \(k\).

In the above model, the residuals \(\epsilon_{ijk}\) need to be normally distributed and heteroscedastic, with trait-specific variances. Furthermore, residuals belonging to the same plot (the two observed values for the two traits) need to be correlated (correlation of errors).

Hans-Peter Piepho, in his paper, put forward the idea that the ‘genotype’ and ‘block’ effects for the two traits can be taken as random, which makes sense, because, for this application, we are mainly interested in variances and covariances. Both random effects (for the genotype and for the block terms) need to be heteroscedastic (trait-specific variance components) and there must be a correlation between the two traits.

It should be noted that, for other applications, the genotype and block effects (especially the former) might be better regarded as fixed, but we will not pursue such an idea in this post.

Fitting a bivariate model

To the best of my knowledge, there is no way to fit such a complex model with the ‘nlme’ package and related ‘lme()’ function (I’ll gave a hint later on, for a simpler model). In a previous post at this link, I have given a solution based on the ‘asreml’ package (Butler et al., 2018), but this is a paid option. In more recent times I have discovered the ‘sommer’ package (Covarrubias-Pazaran, 2016), which seems to be very useful and it is suitable to deal with the requirements of this post. The key function of ‘sommer’ is mmer(), and, in order to fit the above model, we need to specify the following components.

The response variables. When we set a bivariate model with ‘sommer’, we can ‘cbind()’ Yield and TKW.
The fixed model, that does not contain any effects, but the intercept (by default, the means for the two effects are separately estimated, as in Piepho, 2018).
The random model, that is composed by the ‘genotype’ and ‘block’ effects. For both, I specified a general unstructured variance covariance matrix, so that we can estimate two different variance components (one per trait) and one covariance component. The resulting coding is ‘~ vsr(usr(Genotype)) + vsr(usr(Block))’.
The residual structure, where the two observations in the same plot are heteroscedastic and correlated. This structure is fitted by default and it does not require any specific coding.

The model call is:

mod.bimix <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(usr(Genotype)) + vsr(usr(Block)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
bimix.obj <- summary(mod.bimix)
coefs <- bimix.obj$varcomp
coefs
##                           VarComp  VarCompSE     Zratio Constraint
## u:Genotype.Yield-Yield 77.6342608 22.0978545  3.5132035   Positive
## u:Genotype.Yield-TKW   38.8320973 15.0930429  2.5728475   Unconstr
## u:Genotype.TKW-TKW     53.8613303 15.3539585  3.5079768   Positive
## u:Block.Yield-Yield     3.7104682  3.9363372  0.9426195   Positive
## u:Block.Yield-TKW      -0.2428322  1.9074202 -0.1273092   Unconstr
## u:Block.TKW-TKW         1.6684549  1.8343512  0.9095613   Positive
## units.Yield-Yield       6.0939217  1.1951482  5.0988836   Positive
## units.Yield-TKW         0.1635821  0.7242898  0.2258518   Unconstr
## units.TKW-TKW           4.4718011  0.8770118  5.0989065   Positive

The box above shows the results about the variance-covariance parameters. In order to get the correlations, I used the function vpredict() and specified the necessary combinations of variance-covariance parameters. It is necessary to remember that estimates, in ‘sommer’, are named as V1, V2, … Vn, according to their ordering in model output.

# Correlation between genotype means
vpredict(mod.bimix, rg ~ V2 / (sqrt(V1)*sqrt(V3) ))
##     Estimate        SE
## rg 0.6005174 0.1306699
# Correlation between block means
vpredict(mod.bimix, rt ~ V5 / (sqrt(V4)*sqrt(V6) ))
##       Estimate        SE
## rt -0.09759658 0.7571256
vpredict(mod.bimix, rt ~ V8 / (sqrt(V7)*sqrt(V9) ))
##      Estimate        SE
## rt 0.03133619 0.1385421

We see that the estimates are very close to those obtained by using the Pearson’s correlation coefficients (see my previous post). The advantage of this mixed model solution is that we can also test hypotheses in a relatively reliable way. For example, I tested the hypothesis that residuals are not correlated by:

coding a reduced model where residuals are heteroscedastic and independent, and
comparing this reduced model with the complete model by way of a REML-based Likelihood Ratio Test (LRT).

Removing the correlation of residuals is easily done, by using the code below.

mod.bimix2 <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(usr(Genotype)) + vsr(usr(Block)),
                   rcov = ~ vsr(units, Gtc = diag(2)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
# Likelihood ratio test
LRT <- anova(mod.bimix, mod.bimix2)
## Likelihood ratio test for mixed models
## ==============================================================
##      Df       AIC       BIC    loLik   Chisq ChiDf PrChisq
## mod1 15 -62.41068 -56.23549 33.20534                      
## mod2 14 -62.35960 -56.18441 33.17980 0.05108     1 0.8212 
## ==============================================================
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likewise, I tried to further reduce the model to test the significance of the correlation between block means and genotype means.

# Removing correlation between block means
mod.bimix3 <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(usr(Genotype)) + vsr(Block, Gtc = diag(2)),
                   rcov = ~ vsr(units, Gtc = diag(2)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
# Likelihood ratio test
LRT <- anova(mod.bimix, mod.bimix3)
## Likelihood ratio test for mixed models
## ==============================================================
##      Df       AIC       BIC    loLik   Chisq ChiDf  PrChisq
## mod1 15 -62.41068 -56.23549 33.20534                       
## mod2 13 -62.34401 -56.16882 33.17200 0.06667     2 0.96722 
## ==============================================================
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Removing correlation between genotype means
mod.bimix4 <- mmer(cbind(Yield, TKW) ~ 1,
                   random = ~ vsr(Genotype, Gtc = diag(2)) + vsr(Block, Gtc = diag(2)),
                   rcov = ~ vsr(units, Gtc = diag(2)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
# Likelihood ratio test
LRT <- anova(mod.bimix, mod.bimix4)
## Likelihood ratio test for mixed models
## ==============================================================
##      Df       AIC       BIC    loLik    Chisq ChiDf  PrChisq
## mod1 15 -62.41068 -56.23549 33.20534                        
## mod2 12 -51.42519 -45.25000 27.71260 10.98549     3 0.0118 *
## ==============================================================
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that only the genotype means are significantly correlated to one another.

A solution with ‘lme()’

Considering that the block means are not correlated, if we were willing to take the ‘block’ effect as fixed, we could fit a bivariate mixed model also with the ‘nlme’ package and the function lme() (Pinheiro and Bates, 2000). However, we should cast the model as a univariate model.

To this aim, the two variables (Yield and TKW) need to be piled up and a new factor (‘Trait’) needs to be introduced to identify the observations for the two traits. Another factor is also necessary to identify the different plots, i.e. the observational units. To perform such a restructuring, I used the melt() function in the ‘reshape2’ package (Wickham, 2007) and assigned the name ‘Y’ to the response variable, that is now composed by the two original variables Yield and TKW, one on top of the other.

dataset$Plot <- 1:81
mdataset <- reshape2::melt(dataset[,c(-3,-5)], variable.name= "Trait", value.name="Y", id=c("Genotype", "Block", "Plot"))
mdataset$Block <- factor(mdataset$Block)
head(mdataset)
##     Genotype Block Plot Trait    Y
## 1 arcobaleno     1    1   TKW 44.5
## 2 arcobaleno     2    2   TKW 42.8
## 3 arcobaleno     3    3   TKW 42.7
## 4       baio     1    4   TKW 40.6
## 5       baio     2    5   TKW 42.7
## 6       baio     3    6   TKW 41.1
tail(mdataset)
##     Genotype Block Plot Trait     Y
## 157  vesuvio     1   76 Yield 54.37
## 158  vesuvio     2   77 Yield 55.02
## 159  vesuvio     3   78 Yield 53.41
## 160 vitromax     1   79 Yield 54.39
## 161 vitromax     2   80 Yield 50.97
## 162 vitromax     3   81 Yield 48.83

The fixed model is:

Y ~ Trait*Block

In order to introduce a totally unstructured variance-covariance matrix for the random effect, I used the ‘pdMat’ construct:

random = list(Genotype = pdSymm(~Trait - 1))

Relating to the residuals, heteroscedasticity can be included by using the ‘weight()’ argument and the ‘varIdent’ variance function, which allows a different standard deviation per trait:

weight = varIdent(form = ~1|Trait)

I also allowed the residuals to be correlated within each plot, by using the ‘correlation’ argument and specifying a general ‘corSymm()’ correlation form. Plots are nested within genotypes, thus I used a nesting operator, as follows:

correlation = corSymm(form = ~1|Genotype/Plot)

The final model call is:

mod <- lme(Y ~ Trait*Block, 
                 random = list(Genotype = pdSymm(~Trait - 1)),
                 weight = varIdent(form=~1|Trait), 
                 correlation = corCompSymm(form=~1|Genotype/Plot),
                 data = mdataset)

Retrieving the variance-covariance matrices needs some effort, as the function ‘getVarCov()’ does not appear to work properly with this multi-stratum model. First of all, we can retrieve the variance-covariance matrix for the genotype random effect (G) and the corresponding correlation matrix.

#Variance structure for random effects
obj <- mod
G <- matrix( as.numeric(getVarCov(obj)), 2, 2 )
G
##          [,1]     [,2]
## [1,] 53.86081 38.83246
## [2,] 38.83246 77.63485
cov2cor(G)
##           [,1]      [,2]
## [1,] 1.0000000 0.6005237
## [2,] 0.6005237 1.0000000

Second, we can retrieve the ‘conditional’ variance-covariance matrix (R), that describes the correlation of errors:

#Conditional variance-covariance matrix (residual error)
V <- corMatrix(obj$modelStruct$corStruct)[[1]] #Correlation for residuals
sds <- 1/varWeights(obj$modelStruct$varStruct)[1:2]
sds <- obj$sigma * sds #Standard deviations for residuals (one per trait)
R <- t(V * sds) * sds #Going from correlation to covariance
R
##           [,1]      [,2]
## [1,] 4.4718234 0.1635375
## [2,] 0.1635375 6.0939251
cov2cor(R)
##            [,1]       [,2]
## [1,] 1.00000000 0.03132756
## [2,] 0.03132756 1.00000000

The total correlation matrix is simply obtained as the sum of G and R:

Tr <- G + R
cov2cor(Tr)
##           [,1]      [,2]
## [1,] 1.0000000 0.5579906
## [2,] 0.5579906 1.0000000

We see that the same results can be obtained by using ‘sommer’ and regarding the block effect as fixed, although the coding is below is much neater!

mod.bimix5 <- mmer(cbind(Yield, TKW) ~ Block,
                   random = ~ vsr(usr(Genotype)),
                   data = dataset,
                  verbose = FALSE, dateWarning = FALSE)
mod.bimix5$sigma
## $`u:Genotype`
##          Yield      TKW
## Yield 77.63425 38.83209
## TKW   38.83209 53.86133
## 
## $units
##           Yield       TKW
## Yield 6.0939217 0.1635824
## TKW   0.1635824 4.4718011

Hope this was useful… should you have any better solutions, I’d be happy to learn them; please, drop me a line at the address below. Thanks for reading and happy coding!

Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
andrea.onofri@unipg.it

Follow @onofriandreapg

References

Butler, D., Cullis, B.R., Gilmour, A., Gogel, B., Thomson, R., 2018. ASReml-r reference manual – version 4. UK.
Covarrubias-Pazaran, G., 2016. Genome-Assisted Prediction of Quantitative Traits Using the R Package sommer. PLOS ONE 11, e0156744. https://doi.org/10.1371/journal.pone.0156744
Piepho, H.-P., 2018. Allowing for the structure of a designed experiment when estimating and testing trait correlations. The Journal of Agricultural Science 156, 59–70.
Pinheiro, J.C., Bates, D.M., 2000. Mixed-effects models in s and s-plus. Springer-Verlag Inc., New York.
Wickham, H., 2007. Reshaping data with the reshape package. Journal of Statistical Software 21.

To leave a comment for the author, please follow the link and comment on their blog: R on Fixing the bridge between biologists and statisticians.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Fitting a bivariate model

A solution with ‘lme()’

References

Related