Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello. Welcome to my debut post ! Check the About link to see what this Blog intends to accomplish. In this article I discuss a general approach for dealing with the problem of splitting a data frame based on a grouping variable and then doing some more operations per group. A secondary goal is to provide some exposure to the “apply” family of functions to demonstrate how they can help. For purposes of example R has a built-in data frame called “ChickWeight” that contains 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. To get more details simply type:
data(ChickWeight) ?ChickWeight
In general when looking at data frames we should be looking for the continuous variables that we want to summarize in terms of some factor or grouping variable. Grouping variables can usually be easily identified because they typically take on a fixed number of values. (Note that R has commands that let us “cut” continuous variables into discrete intervals but that is for another blog post). So let’s use the sapply function to determine what the grouping variables might be:
sapply(ChickWeight, function(x) length(unique(x))) weight Time Chick Diet 212 12 50 4
If you don’t know about sapply then don’t freak out. It takes the function we define in the second argument and “applies” it to the columns of data frame, (ChickWeight), whose name we provided as the first argument. The “x” in the function is a placeholder argument for each of the columns of the data frame. The unique function returns only the unique values that the column assumes. The length function then “counts” the number of unique values.
This can be confusing I know. However, the important thing is that we can easily observe that Diet and Time look like potential grouping variables since they take on, respectively, 12 and 4 unique values. If we just wanted to know the mean weight of the chickens for each Diet type then we could use the tapply command to get that answer.
tapply(ChickWeight$weight,ChickWeight$Diet,mean) 1 2 3 4 103 123 143 135
Note that we could also write our own function in place of “mean” in case we wanted to so something more in depth in terms of summary. But maybe we don’t yet know what it is we want to do with the data. So having it in a grouped format might help us better understand the data. Let’s use the split command, which can give us access to the individual groups.
my.splits = split(ChickWeight, ChickWeight$Diet) length(my.splits) [1] 4 names(my.splits) [1] "1" "2" "3" "4"
This operation creates a list where each element of my.splits corresponds to the individual Diet values (1,2,3, or 4). We can now start thinking about how to further investigate each Diet type. To convince yourself that the split command actually worked, lets take a peek at the first element. All records in the first element relate to Diet #1.
head(my.splits[[1]]) weight Time Chick Diet 1 42 0 1 1 2 51 2 1 1 3 59 4 1 1 4 64 6 1 1 5 76 8 1 1 6 93 10 1 1
Just for fun let’s use the lapply command to subset into each element of my.splits to obtain all the chicks from each group that weigh less that 40 grams. This approach doesn’t take much typing and will conveniently return a list, which we store in a variable called my.results.lapply
my.results.lapply = lapply(my.splits, subset, weight <= 40)
In the example above we pass additional arguments to the subset command by adding the arguments after the function call. This is the most “R like” way of doing things. Alternatively, we could have defined an anonymous function to do this.
my.results.lapply = lapply(my.splits, function(x) subset(x, weight <= 40) )
Note that we can also define our function in advance. This isn’t substantially different from the example above but it might improve the readability of the code. This is a personal choice and either approach will yield the same result.
my.func <- function(x) { subset(x, weight <= 40) } my.results.lapply = lapply(my.splits, my.func)
In any case check out what is happening here. lapply creates a “hidden” subscript that passes to the function on the right the value of each element of my.splits. This can be confusing to newcomers but is actually a convenience since it accomplishes a for-loop structure “under the hood”. To make it more obvious we could have also done the following, which is kind of like a for-loop approach where you work with the length of the my.splits list using subscripts.
my.results.lapply = lapply(1:length(my.splits), function(x) subset(my.splits[[x]], weight <= 40))
If we are done with our sub-setting and interrogation then we can repackage the results list back into a data frame by using this construct:
my.df = do.call(rbind, my.results.lapply) my.df weight Time Chick Diet 13 40 0 2 1 26 39 2 3 1 195 39 0 18 1 196 35 2 18 1 221 40 0 21 2 269 40 0 25 2 293 39 0 27 2 305 39 0 28 2 317 39 0 29 2 365 39 0 33 3 401 39 0 36 3 519 40 0 46 4 543 39 0 48 4 555 40 0 49 4
So now we have a single data frame with only the data that we want. If you are coming from another programming language then you might be tempted to write a for-loop to do this. You could though you have to do a little more work to keep up with the results. You have to create your own blank list to stash results:
my.results.for = list() for (ii in 1:length(my.splits)) { my.results.for[[ii]] = subset(my.splits[[ii]], weight <= 40) } names(my.results.for) = names(my.splits) all.equal(my.results.lapply, my.results.for) # Should be equal to my.results.lapply
So which approach do you use ? It depends. The for-loop approach is a more traditional angle. However, the lapply takes much less typing, and can sometimes perform better, but you have to define your function to work with it and understand that apply is “silently” passing each element to your function. Once people become accustomed to using lapply they usually stick with it.
Now not to blow your mind but we could knock out the problem in one go:
lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40) # OR (do.call(rbind, lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40))) weight Time Chick Diet 1.13 40 0 2 1 1.26 39 2 3 1 1.195 39 0 18 1 1.196 35 2 18 1 2.221 40 0 21 2 2.269 40 0 25 2 2.293 39 0 27 2 2.305 39 0 28 2 2.317 39 0 29 2 3.365 39 0 33 3 3.401 39 0 36 3 4.519 40 0 46 4 4.543 39 0 48 4 4.555 40 0 49 4
Lastly, it occurred to me after writing much of this post that, in this example, which is admittedly somewhat contrived, there is another way to do this. This realization is fairly common in R especially when you view code written by others. This is the blessing, (and curse), of R. There are many ways to do the same thing. In fact there are frequently several functions, which very much appear to the same thing (or could be made to). Anyway, relative to our problem we could first sort the ChickWeight data frame by Diet and then do a subset. The sort will arrange the data by group, which means that the resulting subset operation will preserve that order.
sorted.chickens = ChickWeight[order(ChickWeight$Diet),] (sorted.chickens = subset(sorted.chickens, weight <= 40)) weight Time Chick Diet 13 40 0 2 1 26 39 2 3 1 195 39 0 18 1 196 35 2 18 1 221 40 0 21 2 269 40 0 25 2 293 39 0 27 2 305 39 0 28 2 317 39 0 29 2 365 39 0 33 3 401 39 0 36 3 519 40 0 46 4 543 39 0 48 4 555 40 0 49 4
Filed under: R programming apply lapply tapply
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.