Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The best answer seems to be plyr. It automates the The split-apply-combine strategy for data analysis you’d use otherwise. The ddply splits a data frame into subset data frames, performs some function on the subsets, and returns the results as a recombined data frame. Here’s a few links: A Fast Intro to Plyr for R, Block-processing a data frame with plyr and Split, apply, and combine in R using PLYR.
This paper is worth reading. It introduces the library and also gives you a nice framework (split-apply-combine) for thinking about a whole class of data-munging problems. A coworker (thanks, Gustavo) pointed out that this is a lot like Google’s MapReduce.
Some commands that get you part of the way there are: split, by, tapply (nicely explained here), aggregate. The R wiki has an entry on Performing calculations within sub-sets of a data-frame that uses the reshape library. You could always use sqldf or RSQLite. Several options are discussed here. You can cobble up a fully general process using split, some form of sapply, and unsplit. But, that’s what plyr does automatically.
Side notes: While fooling around with this, I noticed that, for some crazy reason, split.data.frame splits matrices into nice subunits, but split has the ugly side-effect of reducing matrices to vectors. Also, Google has a style guide for R.
More R mini-tutorials:
Links
- Mapping R functions to SQL nicely explains the subset, transform and aggregate functions.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.