Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The aggregate function can be very useful in R, allowing one to run a function (e.g. mean) within groups of rows, in each column in a matrix/data-frame and organize the results in an easy-to-read table. However, the function takes long to run for very wide matrices and data frames, where the number of the columns are large. I this post I demonstrate the issue and show a couple of nice solutions that at least for the example cuts down the time to 15% and even less, compared to the run-time of the aggregate function.
I first created a wide matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using normal distribution.
# The necessity to avoid wide matrices (with lots of columns)!
matWide= matrix(rnorm(1e+06),nrow=100, ncol=10000)
# Transform matrix to data frame
dfWide=as.data.frame(matWide)
I used the aggregate function to take the mean within groups of rows, for each column ! I realized that the aggregate function takes about 4 seconds to run.
aggRes=aggregate(dfWide, list(rep(1:10, each=10)), mean)
(timeDifAggr=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 3.807029 secs
Here is the fist 5 columns and rows of the result data frame and its dimensions.
aggRes[1:5,1:5]
# Group.1 V1 V2 V3 V4
#1 1 0.008815372 0.56920407 0.2195522 0.68183883
#2 2 0.046319580 0.07915253 0.2732586 0.30970451
#3 3 0.154718798 -0.09157008 -0.3676212 -0.02970137
#4 4 0.491208585 0.53066464 -0.1407269 0.49633703
#5 5 -0.397868879 -0.09793382 0.4154764 -0.17150871
dim(aggRes)
#[1] 10 10001
Then I used a nested ‘apply’ function (technically a tapply inside an apply function) approach to run the same analysis. It took significantly less time (about half a second).
t1=Sys.time()
nestApplyRes=apply(dfWide, 2, function(x){
return(tapply(x, rep(1:10, each=10), mean))})
nestApplyRes=data.frame(Group.1=rownames(nestApplyRes),
nestApplyRes)
(timeDifNest=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.5010331 secs
#Check if it provides exactly the same result as aggregate
all(aggRes==nestApplyRes)
#[1] TRUE
Eventually, I used the data tables as it has been suggested by few in some forums. It took even less time to run; about 0.26 second.
library(data.table)
t1=Sys.time()
#Convert to data.table and compute means in column-major order (like aggregate)
dtRes <- as.data.table(dfWide)[, lapply(.SD, function(x) mean(x)), by = .(Group.1 = rep(1:10, each = 10))]
dtRes=as.data.frame(dtRes)
(timeDifDt=difftime(Sys.time(), t1, units = “secs”))
#Time difference of 0.268255 secs
all(aggRes==dtRes)
#TRUE
I also plotted the run time of each of the approaches!
jpeg(“TimeDif.jpg”, res=300, width=800, height=800)
par(mar = c(6.5, 2.5, 1.5, 0.5))
barplot(height = as.numeric(c(timeDifAggr, timeDifNest, timeDifDt)),
names.arg =c(“Aggregate”, “Nested apply”, “Data table”),
las=2 , ylim=c(0,4), col=heat.colors(3), ylab=”Sec”)
dev.off()
So now I’ll think twice before using the aggregate function 😒.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.