Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The aggregate function can be very useful in R, allowing one to run a function (e.g. mean) within groups of rows, in each column in a matrix/data-frame and organize the results in an easy-to-read table. However, the function takes long to run for very wide matrices and data frames, where the number of the columns are large. I this post I demonstrate the issue and show a couple of nice solutions that at least for the example cuts down the time to 15% and even less, compared to the run-time of the aggregate function.

 I first created a wide matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using normal distribution. 

# The necessity to avoid wide matrices (with lots of columns)!
matWide= matrix(rnorm(1e+06),nrow=100, ncol=10000)

# Transform matrix to data frame
dfWide=as.data.frame(matWide)

I used the aggregate function to take the mean within groups of rows, for each column !  I realized that the aggregate function takes about 4 seconds to run.

t1=Sys.time()
aggRes=aggregate(dfWide, list(rep(1:10, each=10)), mean)
(timeDifAggr=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 3.807029 secs

Here is the fist 5 columns and rows of the result data frame and its dimensions.

aggRes[1:5,1:5]
#  Group.1           V1          V2         V3          V4
#1       1  0.008815372  0.56920407  0.2195522  0.68183883
#2       2  0.046319580  0.07915253  0.2732586  0.30970451
#3       3  0.154718798 -0.09157008 -0.3676212 -0.02970137
#4       4  0.491208585  0.53066464 -0.1407269  0.49633703
#5       5 -0.397868879 -0.09793382  0.4154764 -0.17150871

dim(aggRes)
#[1]    10 10001
 

Then I used a nested ‘apply’ function (technically a tapply inside an apply function) approach to run the same analysis. It took significantly less time (about half a second).

t1=Sys.time()
nestApplyRes=apply(dfWide, 2, function(x){
  return(tapply(x, rep(1:10, each=10), mean))})
nestApplyRes=data.frame(Group.1=rownames(nestApplyRes),
                        nestApplyRes)
(timeDifNest=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.5010331 secs

#Check if it provides exactly the same result as aggregate
all(aggRes==nestApplyRes)
#[1] TRUE

Eventually, I used the data tables as it has been suggested by few in some forums. It took even less time to run; about 0.26 second. 

 
library(data.table)
t1=Sys.time()
#Convert to data.table and compute means in column-major order (like aggregate)
dtRes <- as.data.table(dfWide)[, lapply(.SD, function(x) mean(x)), by = .(Group.1 = rep(1:10, each = 10))]
dtRes=as.data.frame(dtRes)
(timeDifDt=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.268255 secs

all(aggRes==dtRes)
#TRUE

I also plotted the run time of each of the approaches!

 
jpeg(“TimeDif.jpg”, res=300, width=800, height=800)
par(mar = c(6.5, 2.5, 1.5, 0.5))
barplot(height = as.numeric(c(timeDifAggr, timeDifNest, timeDifDt)),
        names.arg =c(“Aggregate”, “Nested apply”, “Data table”),
        las=2 , ylim=c(0,4), col=heat.colors(3), ylab=”Sec”)
dev.off()

So now I’ll think twice before using the aggregate function 😒.

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)