Steve Culman on the `plyr` Package

[This article was first published on Noam Ross - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At Davis R Users’ Group yesteray, Steve Culman gave us an introduction to the plyr package and how to use it to manipulate data. Here’s his presentation, and the accompanying demonstration script:

## Some examples using the package plyr
library(plyr)
## Example dataset from ggplot
library(ggplot2)
data(mpg)
str(mpg)
## Simplify the dataset
data <- mpg[,c(1,7:9)]
str(data)
## Summarising/ Aggregating Data
ddply(data, .(manufacturer), summarize, avgcty = mean(cty))
## you can perform multiple functions in a single call
ddply(data, .(manufacturer), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy))
## you can summarize data by a combination of variables/factors
ddply(data, .(manufacturer, drv), summarize, avgcty = mean(cty), sdcty = sd(cty), maxhwy = max(hwy))
## note the package reshape/reshape2 is an elegant alternative for aggregating many variables at one time
## note the differences between the commands "summarize" and "transform"
ddply(data, .(drv), summarize, avgcty = mean(cty))
ddply(data, .(drv), transform, avgcty = mean(cty))
## transform is very useful standardizing/normalizing
ddply(data, .(drv), transform, delta = mean(cty)-cty)
## Now let's use plyr to run a simple loop
## We'll ask the question: Does city mpg differ between car manufacturers, for each class of drivetrains (4x4, forward, or rear-wheel drive)? Let's try to automate these ANOVAs and extract the F-statistics and P-values from the ANOVAs.
## Step1: create function to run ANOVA
model <- function(data) { aov(cty~manufacturer, data=data) }
## Step 2: Use plyr to run model for each and create list (called anova.output) to store output for each drivetrain. For dlply, the syntax means d for input data is data frame and l for output data is list.
anova.output <- dlply(data, .(drv), model)
## Step 3: Create function that tells R where to find F-statistic and P-value in the output within the list. The output is somewhat hidden in this example- don't worry about the messy indexing here-- what's important is that this just tells R where the F-stats and P-values are stored.
juicy <- function(x) { c(summary(x)[[1]][["F value"]][[1]],
summary(x)[[1]][["Pr(>F)"]][[1]]) }
## Step 4: Extract components of model output from the list created in previous step. For ldply, the syntax is: input is list and output is data frame. Note that since the input is a list, we don't have to indicate the 2nd parameter (which variable(s) to apply the function to, as the default is to apply function to all elements of the list.)
ldply(anova.output, juicy)
## The data frame shows F-statistics (V1) and P-values (V2) for the ANOVAs by drivetrain.
## We could always condense some of the above steps as well:
anova.output <- dlply(data, .(drv), function(data) aov(cty~manufacturer, data=data))
ldply(anova.output, function(x) { c(summary(x)[[1]][["F value"]][[1]], summary(x)[[1]][["Pr(>F)"]][[1]]) })
## Note that there are many shortcuts that plyr uses, such as the functions colwis(), each() and splat(). You can always refer to the original article: http://www.jstatsoft.org/v40/i01/ for more on this.

Steve’s talk is based on this paper by Hadley Wickham in the Journal of Statistical Software. A lot of useful related resources are at Hadley Wickham’s plyr website.

We had a quick exchange about using plyr for parallel processing. More on that was discussed on the listserv here

To leave a comment for the author, please follow the link and comment on their blog: Noam Ross - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)