R: Nutrient intake data, mixed methods analysis for males
[This article was first published on R in the Antipodes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To recap:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
- we’ve cleaned the data (see earlier posts)
- we’re up to constructing the output data that will be used to estimate the population distribution values for the particular nutrient we are examining (energy)
A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both
- age bands used for nutrient reference values – there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and
- a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption
#The data equivalency with the SAS macro data:
#SAS macro variable SAS R
#subject seqn RespondentID
#response add_sug IntakeAmt
#repeat drddaycd IntakeDay
#covars_amt {Age groups} AgeFactor Note: no dummies as AgeFactor is an ordered factor
#replicate_var rndw1 SampleWeight Note: used as a frequency variable (“replicate” in SAS), used as weight in R
#seq seq2 IntakeDay Note: no dummy as IntakeDay is an ordered factor
# indwt SampleWeight
#ANALYSIS OF MALE DATA – STEP 1 OF 2
#MIXTRAN
#Delete any unused RespondentID levels that are related to the female subjects
Male.Data$RespondentID <- Male.Data$RespondentID[,drop=TRUE]
#row 419 reduce the data to one record per person to calculate weights
Male.Subset <- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))
#row 446 add the total weights to the data frame
Male.Subset$TotWts <- sum(Male.Subset$SampleWeight)
#row 455 create the subgroup weights, these are all by age only
Male.Grpwt <- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)
names(Male.Grpwt)[names(Male.Grpwt)==”Male.Subset.AgeFactor”] <- "AgeFactor"
names(Male.Grpwt)[names(Male.Grpwt)==”x”] <- "GrpWts"
Male.Subset <- merge(Male.Subset,Male.Grpwt, by=c("AgeFactor"))
#row 472 get the number of persons in total and by subgroup if given
#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output
count <- function(x) {
length(na.omit(x))
}
Numsub <- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count)
names(Numsub)[names(Numsub)==”Male.Subset.AgeFactor”] <- "AgeFactor"
names(Numsub)[names(Numsub)==”x”] <- "NumSubjects"
Numsub$TotSubjects <- sum(Numsub$NumSubjects)
#row 492 merge numsub to _persons
Male.Subset <- merge(Male.Subset,Numsub, by=c("AgeFactor"))
Male.Subset <- Male.Subset[order(Male.Subset$RespondentID),]
#END OF GENERAL SET UP
#convert proc transreg, from row 803
#NOTE: CODE ONLY USED TO IDENTIFY THE LAMBDA VALUE BETWEEN 0.05 AND 1 ASSOCIATED WITH THE MAXIMUM LOG LIKELIHOOD
#fitting a linear model with a BoxCox transformation, on the data that includes replicates
library(MASS)
Male.Boxcox <- as.data.frame(with(Male.Data, {
boxcox(IntakeAmt ~ AgeFactor + IntakeDay,
lambda= seq(0.05, 1, 0.05), plotit=FALSE)
}))
#locate the row containing the largest log likelihood
which.max(Male.Boxcox$y)
Lambda.Value <- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]
#row 818 add boxcox values to Male.Data
Male.Data$BoxCoxXY <- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value
#clean up the R workspace by deleting data frames no longer required
objects()
rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)
#row 840 through nlmixed starting at row 996, do all the data preparation in one hit
#working through nlmixed lines now, prework on variable construction
#not sure why subject is fit as a nonlinear effect, fit using a linear method
#use a model with an intercept for fixed effects
library(lme4)
Male.lme1 <- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),
data = Male.Data,
weights = SampleWeight)
summary(Male.lme1)
plot(fitted(Male.lme1), residuals(Male.lme1))
Male.Data$lmefits <- fitted(Male.lme1)
#Compare with using age as a continuous variable
Male.Boxcox.Age <- as.data.frame(with(Male.Data, {
boxcox(IntakeAmt ~ Age + IntakeDay,
lambda= seq(0.05, 1, 0.05), plotit=FALSE)
}))
which.max(Male.Boxcox.Age$y)
Lambda.Value.Age <- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]
Male.Data.Age$BoxCoxXY <- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age
#Age as continuous fixed effect
Male.lme2 <- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),
data = Male.Data,
weights = SampleWeight)
summary(Male.lme2)
plot(fitted(Male.lme2), residuals(Male.lme2))
Male.Data$Agefits <- fitted(Male.lme2)
Some highlights for me:
- The same transformation was identified as the best for both analyses.
- An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:
Data: Male.Data
Models:
Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)
Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
Male.lme2 5 9894.4 9926.4 -4942.2
Male.lme1 7 9966.2 10011.1 -4976.1 0 2 1
The residuals plot for the age band analysis is reasonably nicely behaved given there are >4000 data points, although there could be a slight upwards trend:
The output from the age band lmer analysis is:
Linear mixed model fit by REML
Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
Data: Male.Data
AIC BIC logLik deviance REMLdev
9994 10039 -4990 9952 9980
Random effects:
Groups Name Variance Std.Dev.
RespondentID (Intercept) 0.19408 0.44055
Residual 0.37491 0.61230
Number of obs: 4498, groups: RespondentID, 2249
Fixed effects:
Estimate Std. Error t value
(Intercept) 13.98016 0.03405 410.6
AgeFactor4to8 0.50572 0.04084 12.4
AgeFactor9to13 0.94329 0.04159 22.7
AgeFactor14to18 1.30654 0.04312 30.3
IntakeDayDay2Intake -0.13871 0.01809 -7.7
Correlation of Fixed Effects:
(Intr) AgFc48 AgF913 AF1418
AgeFactr4t8 -0.775
AgeFctr9t13 -0.761 0.634
AgFctr14t18 -0.734 0.612 0.601
IntkDyDy2In -0.266 0.000 0.000 0.000
To leave a comment for the author, please follow the link and comment on their blog: R in the Antipodes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.