Nutrient intake data, finalising the data in R
[This article was first published on R in the Antipodes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I run plain R in the normal gui under Windows 7, which means no bells and whistles. This means that I find the R gui somewhat awkward to program in. Thanks to advice I received a number of years ago, I use Notepad ++ as my programming environment. It has line numbering, and when you use Language > R through the menu to set the programming language, you get colour coded syntax. It also has the nice feature of emphasizing the current bracket set that you are using, which makes it very easy to see whether you have remembered to close all your brackets – it counts backward from the last open bracket.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We’re finally in R. 🙂 The code below sets up the data sets for nutrient intake analysis, which will be the subject of my next posts. If you’re following along in the SAS macro, the code is the R substitute for the data preparation in the starting macro called “example1_amount_mixtran_distrib.sas” from the Example 1 zip file which is downloadable from this webpage if you don’t want to download the zip immediately.
SAS syntax files, which are identifiable by the .sas as a file extension, can be viewed with any text reader, and I use Notepad++ for that as well. If you open that SAS syntax file, the code below prepares the data for analysis in the mixtran macro, basically down to line 114.
You’ll notice that I comment my code a lot, probably more than most. That is because I have had numerous experiences of coming back to code I wrote 6 months, or a couple of years earlier, and needing to revise it. I have found that what is obvious at the time of programming may not be so obvious as time passes and other programming projects have been completed.
You’ll see the use of the reshape2 package. The data is basically a repeated measures design, as there are two 24-hour recall periods for nutrient intake per person. The data coming in from the .csv file constructed earlier has one row per person, with the nutrient intakes as two variables. For repeated measures, the data analysis later requires one row per intake (i.e. two rows per person). As this is the main data preparation stage, it makes sense to reform the data frame now.
While I cannot supply the data at this point, I will post the header() result before the melt so you can see the type of data in the data frame.
#This section of code duplicates the SAS code from example1_amount_mixtran_distrib.sas from line 1 through line 114
#Read in the Australian energy data
Imported.Data <- read.csv("foo.csv",header=T)
#check that headers have imported fine
names(Imported.Data)
length(Imported.Data)
nrow(Imported.Data)
#sort the data frame by subject
Imported.Data <- Imported.Data[order(Imported.Data$RespondentID),]#check sort worked, look at first few observations
head(Imported.Data)
#melt data frame so that each repeated measure (intake) is one row, and
#create factor to indicate whether it’s a day1 or day2 intake.
#remember that reshape2 package must be installed at this point
library(reshape2)
Long.Data <- melt(Imported.Data, id=1:6, variable="IntakeDay",
measured=c(“Day1Intake”, “Day2Intake”))
names(Long.Data)[names(Long.Data)==”value”]<-"IntakeAmt"
#construct age group factors, lowest age group number = youngest age group
#age groups for analysis are set here (latest edition): http://www.nhmrc.gov.au/guidelines/publications/n35-n36-n37
#ASSUMPTION: no children <1 year old in data
#construct one variable that contains all the age factors
#evaluate from lowest to highest age, evaluation stops when condition is first met
#evaluate from lowest to highest age, evaluation stops when condition is first met
Long.Data$AgeF <-ifelse(Long.Data$Age<=3,1, ifelse(Long.Data$Age<=8,2, ifelse(Long.Data$Age<=13,3,
   ifelse(Long.Data$Age<=18,4, ifelse(Long.Data$Age<=30,5, ifelse(Long.Data$Age<=50,6,
   ifelse(Long.Data$Age<=70,7, ifelse(Long.Data$Age>70,8,””))))))))
Long.Data$AgeFactor <- as.factor(Long.Data$AgeF)
levels(Long.Data$AgeFactor) <- c("1to3","4to8","9to13","14to18","19to30","31to50","51to70","71Plus")
table(Long.Data$AgeF, Long.Data$AgeFactor)#Delete AgeF and any unused AgeFactor levels
Long.Data$AgeF <- NULL
Long.Data$AgeFactor <- Long.Data$AgeFactor[,drop=TRUE]
#Make RespondentID into a factor, it should not be treated as numeric
Long.Data$RespondentID <- as.factor(Long.Data$RespondentID)
#males and females are analysed separately, do not need to be specified as factors,
#construct different data frames for each – the code will duplicate the analysis for the second gender
#ASSUMPTION: males = 1 and females = 2
Male.Data <- subset(Long.Data, Gender==1)
Female.Data <- subset(Long.Data, Gender==2)
The result from head(Long.Data) is:
 NutrientID RespondentID Gender Age BodyWeight SampleWeight Day1Intake Day2Intake
1Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100013Â Â Â Â Â 2Â 15Â Â Â Â Â Â 59.4Â Â Â 0.3335521Â Â 8591.535Â Â 8747.908
2Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100020Â Â Â Â Â 1Â 12Â Â Â Â Â Â 51.6Â Â Â 0.4952835Â 12145.852Â 13495.798
3Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100050Â Â Â Â Â 2Â 15Â Â Â Â Â Â 62.1Â Â Â 0.3335521Â 14202.496Â 13724.582
4Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100100Â Â Â Â Â 2Â Â 4Â Â Â Â Â Â 18.5Â Â Â 0.3563699Â Â 8621.690Â Â 6218.391
5Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100128Â Â Â Â Â 2Â Â 2Â Â Â Â Â Â 13.2Â Â Â 0.1666111Â Â 5140.690Â Â 6427.673
6Â Â Â Â Â Â Â 267Â Â Â Â Â Â 100370Â Â Â Â Â 2Â Â 7Â Â Â Â Â Â 24.9Â Â Â 0.3563699Â Â 7418.029Â 13620.542
To leave a comment for the author, please follow the link and comment on their blog: R in the Antipodes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.