Avoid apply() function in large datasets
[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When we are dealing with large datasets and there is a need to calculate some values like the row/column min/max/rank/mean
etc we should avoid the apply
function because it takes a lot of time. Instead, we can use the matrixStats package and its corresponding functions. Let’s provide some comparisons.
Example of Minimum value per Row
Assume that we want to get the minimum value of each row from a 500 x 500
matrix. Let’s compare the performance of the apply
function from the base
package versus the rowMins
function from the matrixStats
package.
library(matrixStats) library(microbenchmark) library(ggplot2) x <- matrix( rnorm(5000 * 5000), ncol = 5000 ) tm <- microbenchmark(apply(x,1,min), rowMins(x). times = 100L ) tm Unit: milliseconds expr min lq mean median uq max neval apply(x, 1, min) 981.6283 1034.98050 1078.04485 1065.4163 1107.9962 1327.9284 100 rowMins(x) 42.1838 43.80065 46.55752 45.2255 47.6249 81.3097 100
As we can see from the output above, the apply
function was 23 times slower than the rowMins
. Below we represent the violin plot
autoplot(tm)
To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.