A quick primer on split-apply-combine problems
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve just answered my hundred billionth question on Stack Overflow that goes something like
I want to calculate some statistic for lots of different groups.
Although these questions provide a steady stream of easy points, its such a common and basic data analysis concept that I thought it would be useful to have a document to refer people to.
First off, you need to data in the right format. The canonical form in R is a data frame with one column containing the values to calculate a statistic for and another column containing the group to which that value belongs. A good example is the InsectSprays dataset, built into R.
head(InsectSprays) count spray 1 10 A 2 7 A 3 20 A 4 14 A 5 14 A 6 12 A
These problems are widely known as split-apply-combine problems after the three steps involved in their solution. Let’s go through it step by step.
First, we split the count
column by the spray
column.
(count_by_spray <- with(InsectSprays, split(count, spray)))
Secondly, we apply the statistic to each element of the list. Lets use the mean
here.
(mean_by_spray <- lapply(count_by_spray, mean))
Finally, (if possible) we recombine the list as a vector.
unlist(mean_by_spray)
This procedure is such a common thing that there are many functions to speed up the process. sapply
and vapply
do the last two steps together.
sapply(count_by_spray, mean) vapply(count_by_spray, mean, numeric(1))
We can do even better than that however. tapply
, aggregate
and by
all provide a one-function solution to these S-A-C problems.
with(InsectSprays, tapply(count, spray, mean)) with(InsectSprays, by(count, spray, mean)) aggregate(count ~ spray, InsectSprays, mean)
The plyr
package also provides several solutions, with a choice of output format. ddply
takes a data frame and returned another data frame, which is what you’ll want most of the time. ddply
takes a data frame and reurns the uncombined list, which is useful if you want to do another processing step before combining.
ddply(InsectSprays, .(spray), summarise, mean.count = mean(count)) dlply(InsectSprays, .(spray), summarise, mean.count = mean(count))
You can read much more on this type of problem and the plyr
solution in The Split-Apply-Combine Strategy for Data Analysis, in the Journal of Statistical Software, by the ubiquitous Hadley Wickham.
One tiny variation on the problem is when you want the output statistic vector to have the same length as the original input vectors. For this, there is the ave
function (which provides mean
as the default function).
with(InsectSprays, ave(count, spray))
Tagged: apply, combine, plyr, r, split, statistics
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.