Using plyr and doMC for quick and easy apply-family functions

Fellgernon Bit - rstats

9 years ago

[This article was first published on Fellgernon Bit - rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few weeks back I dedicated a short amount of time to actually read what plyr (Wickham, 2011) is about and I was surprised. The whole idea behind plyr is very simple: expand the apply() family to do things easy. plyr has many functions whose name ends with ply which is short of apply. Then, the functions are identified by two letters before ply which are abbreviations for the input (first letter) and output (second one). For instance, ddply takes an input a data.frame and returns a data.frame while ldply takes as input a list and returns a data.frame.

The syntax is pretty straight forward. For example, here are the arguments for ddply:

library(plyr)
args(ddply)
## function (.data, .variables, .fun = NULL, ..., .progress = "none", 
##     .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL) 
## NULL

What we basically have to specify are

.data which in general is the name of the input data.frame,
.variables which is a vector (note the use of the . function) of variable names. In this case, ddply is very useful for applying some function to subsets of the data as specified by these variables,
.fun which is the actual function we want to run,
and ... which are parameter options for the function we are running.

From the ddply help page we have the following examples:

dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
 mean = round(mean(age), 2),
 sd = round(sd(age), 2))
##   group sex  mean    sd
## 1     A   F 40.48 12.72
## 2     A   M 34.48 15.28
## 3     B   F 36.05  9.98
## 4     B   M 38.35  7.97
## 5     C   F 20.04  1.86
## 6     C   M 43.81 10.72

# An example using a formula for .variables
ddply(baseball[1:100, ], ~year, nrow)

##   year V1
## 1 1871  7
## 2 1872 13
## 3 1873 13
## 4 1874 15
## 5 1875 17
## 6 1876 15
## 7 1877 17
## 8 1878  3

# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))

##   lg  nrow ncol
## 1       65   22
## 2 AA   171   22
## 3 AL 10007   22
## 4 FL    37   22
## 5 NL 11378   22
## 6 PL    32   22
## 7 UA     9   22

But this is not the end of the story! Something I really liked about plyr is that it can be parallelized via the foreach (Analytics, 2012) package. I don’t know much about foreach, but all I learnt is that you have to use other packages such as doMC (Analytics, 2013) to actually run the code. It’s like foreach specifies the infraestructure to communicate in parallel (and split jobs) and packages like doMC tailor it for specific environments like for running in multi-core.

Running things in parallel can then be very easy. Basically, you load the packages, specify the number of cores, and run your ply function. Here is a short example:

## Load packages
library(plyr)
library(doMC)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

## Specify the number of cores
registerDoMC(4)

## Check how many cores we are using
getDoParWorkers()
## [1] 4

## Run your ply function
ddply(dfx, .(group, sex), summarize, mean = round(mean(age), 2), sd = round(sd(age), 
    2), .parallel = TRUE)

##   group sex  mean    sd
## 1     A   F 40.48 12.72
## 2     A   M 34.48 15.28
## 3     B   F 36.05  9.98
## 4     B   M 38.35  7.97
## 5     C   F 20.04  1.86
## 6     C   M 43.81 10.72

In case that you are interested, here is a short shell script for knitting an Rmd file in the cluster and specifying the appropriate number of cores to then use plyr and doMC.

#!/bin/bash 
# To run it in the current working directory
#$ -cwd 
# To get an email after the job is done
#$ -m e 
# To speficy that we want 4 cores
#$ -pe local 4
# The name of the job
#$ -N myPlyJob

echo "**** Job starts ****"
date

# Knit your file: assuming it's called FileToKnit.Rmd
Rscript -e "library(knitr); knit2html('FileToKnit.Rmd')"

echo "**** Job ends ****"
date

Lets say that the bash script is named script.sh. Then you can submit it to the cluster queue using

qsub script.sh

This is what I used to re-format a large data.frame in a few minutes in the cluster for the #jhsph753 class homework project.

So, thank you again Hadley Wickham for making awesome R packages!

Citations made with knitcitations (Boettiger, 2013).

Revolution Analytics, (2013) doMC: Foreach parallel adaptor for the multicore package. http://CRAN.R-project.org/package=doMC
Revolution Analytics, (2012) foreach: Foreach looping construct for R. http://CRAN.R-project.org/package=foreach
Carl Boettiger, knitcitations: Citations for knitr markdown files. https://github.com/cboettig/knitcitations
Hadley Wickham, (2011) The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software 40 (1) http://www.jstatsoft.org/v40/i01/

To leave a comment for the author, please follow the link and comment on their blog: Fellgernon Bit - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.