Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, I talk about loops in R, why they can be slow and when it is okay to use them.
Don’t grow objects
Let us generate a matrix of uniform values (max changing for every column).
gen_grow <- function(n = 1e3, max = 1:500) { mat <- NULL for (m in max) { mat <- cbind(mat, runif(n, max = m)) } mat } set.seed(1) system.time(mat1 <- gen_grow(max = 1:500)) ## user system elapsed ## 0.21 0.17 0.37 system.time(mat2 <- gen_grow(max = 1:2000)) ## user system elapsed ## 3.91 3.00 6.92 gen_sapply <- function(n = 1e3, max = 1:500) { sapply(max, function(m) runif(n, max = m)) } set.seed(1) system.time(mat3 <- gen_sapply(max = 1:500)) ## user system elapsed ## 0.01 0.00 0.01 identical(mat3, mat1) ## [1] TRUE system.time(mat4 <- gen_sapply(max = 1:2000)) ## user system elapsed ## 0.07 0.00 0.07 identical(mat4, mat2) ## [1] TRUE
Wow,
sapply()
is so much faster than loops!
Don’t get this wrong, sapply()
or lapply()
is nothing but a loop internally, so sapply()
shouldn’t be any faster than a loop. Here, the problem is not with the loop, but what we do inside this loop. Indeed, in gen_grow()
, at each iteration of the loop, we reallocate a new matrix with one more column, which takes time.
Imagine you want to climb all those stairs, but you have to climb only stair 1, go to the bottom then climb the first 2 stairs, go to the bottom then climb the first three, and so on until you reach the top. This takes way more time than just climbing all stairs at once. This is basically what happens in function gen_grow()
but instead of climbing more stairs, it allocates more memory, which also takes time.
You have at least two solutions to this problem. The first solution is to pre-allocate the whole result once (if you know its size in advance) and just fill it:
gen_prealloc <- function(n = 1e3, max = 1:500) { mat <- matrix(0, n, length(max)) for (i in seq_along(max)) { mat[, i] <- runif(n, max = max[i]) } mat } set.seed(1) system.time(mat5 <- gen_prealloc(max = 1:500)) ## user system elapsed ## 0.02 0.00 0.01 identical(mat5, mat1) ## [1] TRUE system.time(mat6 <- gen_prealloc(max = 1:2000)) ## user system elapsed ## 0.08 0.00 0.08 identical(mat6, mat2) ## [1] TRUE
Another solution that can be really useful if you don’t know the size of the result is to store the results in a list. A list, as opposed to a vector or a matrix, stores its elements in different places in memory (the elements don’t have to be contiguously stored in memory) so that you can add one element to the list without copying the rest of the list.
gen_list <- function(n = 1e3, max = 1:500) { l <- list() for (i in seq_along(max)) { l[[i]] <- runif(n, max = max[i]) } do.call("cbind", l) } set.seed(1) system.time(mat7 <- gen_list(max = 1:500)) ## user system elapsed ## 0.02 0.00 0.02 identical(mat7, mat1) ## [1] TRUE system.time(mat8 <- gen_list(max = 1:2000)) ## user system elapsed ## 0.06 0.00 0.07 identical(mat8, mat2) ## [1] TRUE
Vectorization, why?
I call vectorized a function that takes vectors as arguments and operate on each element of these vectors in another (compiled) language (such as C++ and Fortran).
So, let me repeat myself: sapply()
is not a vectorized function.
Let’s go back to vectorization, why is it so important in R? As an example, let’s compute the sum of two vectors.
add_loop_prealloc <- function(x, y) { res <- double(length(x)) for (i in seq_along(x)) { res[i] <- x[i] + y[i] } res } add_sapply <- function(x, y) { sapply(seq_along(x), function(i) x[i] + y[i]) } add_vectorized <- `+` N <- 1e5; x <- runif(N); y <- rnorm(N) compiler::enableJIT(0) ## disable just-in-time compilation ## [1] 3 microbenchmark::microbenchmark( LOOP = add_loop_prealloc(x, y), SAPPLY = add_sapply(x, y), VECTORIZED = add_vectorized(x, y) ) ## Unit: microseconds ## expr min lq mean median uq max neval cld ## LOOP 139306.750 169242.364 181354.6624 175967.0830 185381.558 345509.411 100 c ## SAPPLY 117026.874 149404.590 164478.9570 161751.1765 174571.052 369295.404 100 b ## VECTORIZED 94.361 193.025 229.3289 227.2925 274.142 381.746 100 a compiler::enableJIT(3) ## default ## [1] 0 microbenchmark::microbenchmark( LOOP = add_loop_prealloc(x, y), SAPPLY = add_sapply(x, y), VECTORIZED = add_vectorized(x, y) ) ## Unit: microseconds ## expr min lq mean median uq max neval cld ## LOOP 141996.836 164231.0225 185083.712 176481.5930 190328.1715 391608.057 100 c ## SAPPLY 116131.283 148153.5755 169681.157 166047.6990 184191.1330 345131.639 100 b ## VECTORIZED 153.625 206.5995 251.045 235.0735 280.7635 419.821 100 a
Here, the vectorized function is much faster than the two others and the for-loop approach is faster than the sapply
equivalent when just-in-time compilation is enabled. (Edit: for some unknown reason, this result is not true anymore when I render this blog post. Try the code yourself!)
As an interpreted language, for each iteration res[i] <- x[i] + y[i]
, R has to ask:
what is the type of
x[i]
andy[i]
?can I add these two types? what is the type of
x[i] + y[i]
then?can I store this result in
res
or do I need to convert it?
These questions must be answered for each iteration, which takes time. On the contrary, for vectorized functions, these questions must be answered only once, which saves a lot of time. Read more with Noam Ross’s blog post on vectorization.
Conclusion
In this post, I don’t say that you shouldn’t use
lapply()
instead of a for-loop. Indeed, it can be more concise and clearer to uselapply()
, but don’t expect miracles with respect to performance. You should also take a look at package {purrr} that provides shortcuts, consistency and some functions to iterate over rows of a data frame.Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don’t use too many iterations (let’s say not more than 100,000 iterations).
Beware what you’re doing in the loops because they can be super slow. Use vectorized operations if you can (search for them in available packages such as {matrixStats}). If you can’t, write your own vectorized functions with {Rcpp}. I have an introduction to {Rcpp} there.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.