Do the same thing to a bunch of variables with lapply()
[This article was first published on W. Andrew Barr's Paleoecology Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is extremely common to have a dataframe containing a bunch of variables, and to do the exact same thing to all of these variables.
For instance, lets say we have a dataframe that has a bunch of limb bone measurements of different animals, and we want to see if they are related to a categorical predictor variable after controlling for the body mass of the animal.
set.seed(500)
categories <- factor(rep(c("A","B","C"),33))
BM <- rnorm(99,mean=100,sd=15)
myData<-data.frame(categories = categories,
BM = BM,
var1 = 0.05 * BM + as.numeric(categories) + rnorm(99,sd=1),
var2 = 0.1 * BM + as.numeric(categories) + rnorm(99,sd=1),
var3 = 0.2 * BM + as.numeric(categories) + rnorm(99,sd=1),
var4 = 0.4 * BM + as.numeric(categories) + rnorm(99,sd=1),
var5 = 0.9 * BM + as.numeric(categories) + rnorm(99,sd=1),
var6 = 0.99 * BM + as.numeric(categories) + rnorm(99,sd=1)
)
rm(categories)
rm(BM)
head(myData)
## categories BM var1 var2 var3 var4 var5 var6 ##1 A 114.53 4.769 12.23 23.87 46.29 104.91 114.57 ##2 B 129.48 9.345 14.21 25.78 55.14 117.26 129.48 ##3 C 113.29 9.604 13.26 25.09 48.38 105.23 116.08 ##4 A 100.46 6.600 12.77 22.17 41.95 93.13 100.06 ##5 B 114.24 7.857 13.86 23.52 46.84 104.18 114.92 ##6 C 91.35 7.289 11.64 22.45 41.04 83.08 92.77
Plotting all our variables against body mass – the long way
We have created 6 variables that are all correlated with our 3-level categorical variable. They also have an increasing correlation with body mass, which we can see in a plot. Your first inclination might be to set up a plotting space with room for 6 plots, and then type out each plot command, like so.
par(mfrow=c(2,3))
with(myData,plot(var1~BM,main=”var1″,pch=16,xlab=”BodyMass”,ylab=”var1″))
with(myData,plot(var2~BM,main=”var2″,pch=16,xlab=”BodyMass”,ylab=”var2″))
with(myData,plot(var3~BM,main=”var3″,pch=16,xlab=”BodyMass”,ylab=”var3″))
with(myData,plot(var4~BM,main=”var4″,pch=16,xlab=”BodyMass”,ylab=”var4″))
with(myData,plot(var5~BM,main=”var5″,pch=16,xlab=”BodyMass”,ylab=”var5″))
with(myData,plot(var6~BM,main=”var6″,pch=16,xlab=”BodyMass”,ylab=”var6″))
That’s not too bad with just 6 variables, but would be annoying with 30 variables. And what if we want to change something about the way we are doing the plot? We will have to change each one of the plotting commands….which I am way too lazy to do. Only the variable name changes each time, everything else is exactly the same. We have a clear case here for replacing our 6 plot commands with a single use of `lapply()`. Note: there are reasons (many of them stylistic) to avoiding explicit `for()` loops in R. Here here is a good introduction to using the apply family of R functions.
click to enlarge |
Using lapply()
lapply() goes through an object and applies a function to each piece, and then returns a list of the same length as the original object. In short, there is an implicit for loop that gets written for you. You can use lapply() to iterate over anything: a list, a dataframe (which is just a special type of list) a vector of numbers, a vector of characters…..whatever. In our case, the variables of interest are stored in columns 3 through 8 of our data frame. So we can use lapply() to go through the numbers 3 through 8 and do the same thing each time. The hardest part of using lapply() is writing the function that is to be applied to each piece. We need to write our own function for lapply() to use. In this case, we’ll call it myPlot(). This function takes an index number (corresponding to the number of the column in the dataframe), plots the corresponding column in myData against the body mass column in myData, and then applies the appropriate labels to the axes. Once we have written the function, we can apply it to all of our columns with the single lapply() line of code.
par(mfrow=c(2,3))
myPlot<-function(index) {plot(myData[,index] ~ myData$BM, main=names(myData[index],pch=16,xlab="BodyMass",ylab=names(myData)[index])}
lapply(3:8,FUN=myPlot)
click to enlarge |
## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL ## ## [[6]] ## NULL
The graphs are identical, and we did it with much less code!! Notice that, besides the plot the output is a bunch of NULL values. This isn’t a mistake….our function myPlot() only plots to the screen, it doesn’t return any values. I this case all we wanted was the side effect of making the plots, but other times we want to return values.
getting something useful from the return value of lapply()
What if we weren’t interested in the plots, but we wanted to do an ANCOVA on each variable, and summarize the results in a readable format? Well, we can do that with lapply() as well. First, we need to create the formulae that describe the ANCOVA model for each variable, then we will use lapply() to loop over each one. And wouldn’t you know…when we look at the effect of body mass in this ANCOVA, it increases in just the way we modeled it! Note that using this method, the names of the variable don’t get preserved, but they are in the order in which they were called, so we could save the results of the lapply() instead of printing them to the screen, and then give then assign them names using the names() function.
formulaeAsText <- paste("var",1:6, " ~ categories + BM", sep="")
formulae <- lapply(formulaeAsText,FUN=formula)
doANCOVAandSummarize <- function(eachFormula) {summary(lm(eachFormula,data=myData))}
lapply(formulae,doANCOVAandSummarize)
[[1]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -2.6854 -0.7945 0.0454 0.7649 2.5183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.01863 0.69970 1.46 0.15 categoriesB 1.26060 0.26736 4.71 8.3e-06 *** categoriesC 2.18138 0.26615 8.20 1.2e-12 *** BM 0.04727 0.00696 6.79 9.7e-10 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 Residual standard error: 1.08 on 95 degrees of freedom Multiple R-squared: 0.566, Adjusted R-squared: 0.552 F-statistic: 41.3 on 3 and 95 DF, p-value: <2e-16 [[2]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -1.9021 -0.6785 0.0142 0.6257 1.8932 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.35198 0.59638 -0.59 0.56 categoriesB 0.94811 0.22788 4.16 7e-05 *** categoriesC 2.31289 0.22685 10.20 <2e-16 *** BM 0.11177 0.00594 18.83 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.919 on 95 degrees of freedom Multiple R-squared: 0.837, Adjusted R-squared: 0.832 F-statistic: 163 on 3 and 95 DF, p-value: <2e-16 [[3]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -3.0073 -0.6261 0.0758 0.6384 2.2393 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.17146 0.61921 1.89 0.0616 . categoriesB 0.63364 0.23661 2.68 0.0087 ** categoriesC 1.55619 0.23553 6.61 2.3e-09 *** BM 0.20117 0.00616 32.64 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.954 on 95 degrees of freedom Multiple R-squared: 0.924, Adjusted R-squared: 0.921 F-statistic: 384 on 3 and 95 DF, p-value: <2e-16 [[4]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -2.6168 -0.6454 -0.0807 0.8429 2.7323 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.64403 0.70493 0.91 0.36323 categoriesB 1.06035 0.26936 3.94 0.00016 *** categoriesC 2.05024 0.26814 7.65 1.7e-11 *** BM 0.40318 0.00702 57.47 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.09 on 95 degrees of freedom Multiple R-squared: 0.973, Adjusted R-squared: 0.973 F-statistic: 1.16e+03 on 3 and 95 DF, p-value: <2e-16 [[5]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -2.1853 -0.6516 0.0161 0.6685 2.0376 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.49750 0.61923 0.80 0.42374 categoriesB 0.89794 0.23661 3.79 0.00026 *** categoriesC 2.08076 0.23554 8.83 5.1e-14 *** BM 0.90522 0.00616 146.88 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.954 on 95 degrees of freedom Multiple R-squared: 0.996, Adjusted R-squared: 0.996 F-statistic: 7.38e+03 on 3 and 95 DF, p-value: <2e-16 [[6]] Call: lm(formula = eachFormula, data = myData) Residuals: Min 1Q Median 3Q Max -2.3768 -0.6681 0.0591 0.5986 1.7515 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.85294 0.60387 1.41 0.16108 categoriesB 0.88162 0.23074 3.82 0.00024 *** categoriesC 2.42557 0.22970 10.56 < 2e-16 *** BM 0.99072 0.00601 164.85 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.931 on 95 degrees of freedom Multiple R-squared: 0.997, Adjusted R-squared: 0.996 F-statistic: 9.29e+03 on 3 and 95 DF, p-value: <2e-16
To leave a comment for the author, please follow the link and comment on their blog: W. Andrew Barr's Paleoecology Blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.