Steve Culman on the `plyr` Package
[This article was first published on Noam Ross - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At Davis R Users’ Group yesteray, Steve Culman gave us an introduction to the plyr
package and how to use it to manipulate data. Here’s his presentation, and the accompanying demonstration script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Some examples using the package plyr | |
library(plyr) | |
## Example dataset from ggplot | |
library(ggplot2) | |
data(mpg) | |
str(mpg) | |
## Simplify the dataset | |
data <- mpg[,c(1,7:9)] | |
str(data) | |
## Summarising/ Aggregating Data | |
ddply(data, .(manufacturer), summarize, avgcty = mean(cty)) | |
## you can perform multiple functions in a single call | |
ddply(data, .(manufacturer), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy)) | |
## you can summarize data by a combination of variables/factors | |
ddply(data, .(manufacturer, drv), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy)) | |
## note the package reshape/reshape2 is an elegant alternative for aggregating many variables at one time | |
## note the differences between the commands "summarize" and "transform" | |
ddply(data, .(drv), summarize, avgcty = mean(cty)) | |
ddply(data, .(drv), transform, avgcty = mean(cty)) | |
## transform is very useful standardizing/normalizing | |
ddply(data, .(drv), transform, delta = mean(cty)-cty) | |
## Now let's use plyr to run a simple loop | |
## We'll ask the question: Does city mpg differ between car manufacturers, for each class of drivetrains (4x4, forward, or rear-wheel drive)? Let's try to automate these ANOVAs and extract the F-statistics and P-values from the ANOVAs. | |
## Step1: create function to run ANOVA | |
model <- function(data) { aov(cty~manufacturer, data=data) } | |
## Step 2: Use plyr to run model for each and create list (called anova.output) to store output for each drivetrain. For dlply, the syntax means d for input data is data frame and l for output data is list. | |
anova.output <- dlply(data, .(drv), model) | |
## Step 3: Create function that tells R where to find F-statistic and P-value in the output within the list. The output is somewhat hidden in this example- don't worry about the messy indexing here-- what's important is that this just tells R where the F-stats and P-values are stored. | |
juicy <- function(x) { c(summary(x)[[1]][["F value"]][[1]], | |
summary(x)[[1]][["Pr(>F)"]][[1]]) } | |
## Step 4: Extract components of model output from the list created in previous step. For ldply, the syntax is: input is list and output is data frame. Note that since the input is a list, we don't have to indicate the 2nd parameter (which variable(s) to apply the function to, as the default is to apply function to all elements of the list.) | |
ldply(anova.output, juicy) | |
## The data frame shows F-statistics (V1) and P-values (V2) for the ANOVAs by drivetrain. | |
## We could always condense some of the above steps as well: | |
anova.output <- dlply(data, .(drv), function(data) aov(cty~manufacturer, data=data)) | |
ldply(anova.output, function(x) { c(summary(x)[[1]][["F value"]][[1]], summary(x)[[1]][["Pr(>F)"]][[1]]) }) | |
## Note that there are many shortcuts that plyr uses, such as the functions colwis(), each() and splat(). You can always refer to the original article: http://www.jstatsoft.org/v40/i01/ for more on this. |
Steve’s talk is based on this paper by Hadley Wickham in the Journal of Statistical Software. A lot of useful related resources are at Hadley Wickham’s plyr
website.
We had a quick exchange about using plyr
for parallel processing. More on that was discussed on the listserv here
To leave a comment for the author, please follow the link and comment on their blog: Noam Ross - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.