Site icon R-bloggers

R: Nutrient intake data, mixed methods analysis for males

[This article was first published on R in the Antipodes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To recap:
The code below is my implementation of the mixtran sas macro for amount, where the nutrient is consumed daily. This means that a bunch of the macro code is irrelevant for the current analysis, as we don’t need to estimate daily consumption probabilities, or apply them in the analysis, because probability of consumption = 1.

A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both
  1. age bands used for nutrient reference values – there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and
  2. a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption
The R code is below. Code lines mentioned relate to the line in the mixtran macro linked above. Note that this method is using a linear mixed effects model on the data, the SAS macro uses a nonlinear mixed effects model, but I am currently unsuccessful in duplicating that exact model into R. The data that I have is much less complex from a covariates perspective compared to the data that was analysed in SAS, so at this point I am not too worried about the change in this part of the method.
#The data equivalency with the SAS macro data:
#SAS macro variable        SAS                R
#subject                seqn            RespondentID
#response                add_sug            IntakeAmt
#repeat                    drddaycd        IntakeDay
#covars_amt                {Age groups}    AgeFactor        Note: no dummies as AgeFactor is an ordered factor
#replicate_var            rndw1            SampleWeight    Note: used as a frequency variable (“replicate” in SAS), used as weight in R
#seq                    seq2            IntakeDay        Note: no dummy as IntakeDay is an ordered factor
#                        indwt            SampleWeight

#ANALYSIS OF MALE DATA – STEP 1 OF 2
#MIXTRAN
#Delete any unused RespondentID levels that are related to the female subjects
Male.Data$RespondentID <- Male.Data$RespondentID[,drop=TRUE]
#row 419 reduce the data to one record per person to calculate weights
Male.Subset <- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))
#row 446 add the total weights  to the data frame
Male.Subset$TotWts <- sum(Male.Subset$SampleWeight)
#row 455 create the subgroup weights, these are all by age only
Male.Grpwt <- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)
names(Male.Grpwt)[names(Male.Grpwt)==”Male.Subset.AgeFactor”] <- “AgeFactor”
names(Male.Grpwt)[names(Male.Grpwt)==”x”] <- “GrpWts”
Male.Subset <- merge(Male.Subset,Male.Grpwt, by=c(“AgeFactor”))
#row 472 get the number of persons in total and by subgroup if given
#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output
count <- function(x) {
  length(na.omit(x))
}
Numsub <- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count)
names(Numsub)[names(Numsub)==”Male.Subset.AgeFactor”] <- “AgeFactor”
names(Numsub)[names(Numsub)==”x”] <- “NumSubjects”
Numsub$TotSubjects <- sum(Numsub$NumSubjects)
#row 492 merge numsub to _persons
Male.Subset <- merge(Male.Subset,Numsub, by=c(“AgeFactor”))
Male.Subset <- Male.Subset[order(Male.Subset$RespondentID),]
#END OF GENERAL SET UP
#convert proc transreg, from row 803
#NOTE: CODE ONLY USED TO IDENTIFY THE LAMBDA VALUE BETWEEN 0.05 AND 1 ASSOCIATED WITH THE MAXIMUM LOG LIKELIHOOD
#fitting a linear model with a BoxCox transformation, on the data that includes replicates
library(MASS)
Male.Boxcox <- as.data.frame(with(Male.Data, {
    boxcox(IntakeAmt ~ AgeFactor + IntakeDay,
    lambda= seq(0.05, 1, 0.05), plotit=FALSE)
    }))   
#locate the row containing the largest log likelihood
which.max(Male.Boxcox$y)
Lambda.Value <- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]
#row 818 add boxcox values to Male.Data
Male.Data$BoxCoxXY <- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value
#clean up the R workspace by deleting data frames no longer required
objects()
rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)
#row 840 through nlmixed starting at row 996, do all the data preparation in one hit
#working through nlmixed lines now, prework on variable construction
#not sure why subject is fit as a nonlinear effect, fit using a linear method
#use a model with an intercept for fixed effects
library(lme4)
Male.lme1 <- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),
            data = Male.Data,
            weights = SampleWeight)
summary(Male.lme1)
plot(fitted(Male.lme1), residuals(Male.lme1))
Male.Data$lmefits <- fitted(Male.lme1)
#Compare with using age as a continuous variable
Male.Boxcox.Age <- as.data.frame(with(Male.Data, {
    boxcox(IntakeAmt ~ Age + IntakeDay,
    lambda= seq(0.05, 1, 0.05), plotit=FALSE)
    }))   
which.max(Male.Boxcox.Age$y)
Lambda.Value.Age <- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]
Male.Data.Age$BoxCoxXY <- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age
#Age as continuous fixed effect
Male.lme2 <- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),
            data = Male.Data,
            weights = SampleWeight)
summary(Male.lme2)
plot(fitted(Male.lme2), residuals(Male.lme2))
Male.Data$Agefits <- fitted(Male.lme2)


Some highlights for me:
  1. The same transformation was identified as the best for both analyses.
  2. An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:
anova(Male.lme1,Male.lme2)
Data: Male.Data
Models:
Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)
Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
          Df    AIC     BIC  logLik Chisq Chi Df Pr(>Chisq)
Male.lme2  5 9894.4  9926.4 -4942.2                       
Male.lme1  7 9966.2 10011.1 -4976.1     0      2          1

 The residuals plot for the age band analysis is reasonably nicely behaved given there are >4000 data points, although there could be a slight upwards trend:
The output from the age band lmer analysis is:


Linear mixed model fit by REML
Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
   Data: Male.Data
  AIC   BIC logLik deviance REMLdev
 9994 10039  -4990     9952    9980
Random effects:
 Groups       Name        Variance Std.Dev.
 RespondentID (Intercept) 0.19408  0.44055
 Residual                 0.37491  0.61230
Number of obs: 4498, groups: RespondentID, 2249

Fixed effects:
                    Estimate Std. Error t value
(Intercept)         13.98016    0.03405   410.6
AgeFactor4to8        0.50572    0.04084    12.4
AgeFactor9to13       0.94329    0.04159    22.7
AgeFactor14to18      1.30654    0.04312    30.3
IntakeDayDay2Intake -0.13871    0.01809    -7.7

Correlation of Fixed Effects:
            (Intr) AgFc48 AgF913 AF1418
AgeFactr4t8 -0.775                    
AgeFctr9t13 -0.761  0.634             
AgFctr14t18 -0.734  0.612  0.601      
IntkDyDy2In -0.266  0.000  0.000  0.000
To leave a comment for the author, please follow the link and comment on their blog: R in the Antipodes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.