Be kind don’t rbind

David Schoch

16 hours ago

[This article was first published on schochastics - all things R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day I was helping to refactor an R package and came across one of the biggest performance blockers there is: dynamically growing matrices. Of course I repeated the mantra “Always preallocate your variables” but in this case, it is not clear how big (in terms of rows) the final matrix will be. So there is no way around growing the matrix dynamically.

The chosen approach in the package was to use rbind similar to this

mat <- vector()
while(<condition>) {
    if(<condition>) {
        tmp <- <do calculation>
        mat <- rbind(mat, tmp)
    }
}

Disregarding performance, this seems like a sensible approach. Just add new rows to the end of the matrix. But a little bit of profiling showed that this was an extreme bottleneck in the function it appears. So what are viable alternatives? let’s benchmark some solutions. Note that we assume to know n here but in reality, we do not know how big it will be at the end.

As a baseline we implement a function that preallocates memory.

fmat <- function(n) {
    res <- matrix(NA, n, n)
    for (i in 1:n) {
        res[i, ] <- runif(n)
    }
    res
}

The first contender is the rbind approach.

frbind <- function(n) {
    res <- vector()
    for (i in 1:n) {
        res <- rbind(res, runif(n))
    }
    res
}

For the second approach, we try to reduce the number of rbinds by growing the final matrix in chunks. For csize = 1 we obtain frbind() and csize = n we have fmat().

fchunks <- function(n, csize = 10) {
    chunk <- matrix(NA, csize, n)
    res <- vector()
    for (i in 1:n) {
        if (i %% csize == 0) {
            chunk[csize, ] <- runif(n)
            res <- rbind(res, chunk)
            chunk <- matrix(NA, csize, n)
        } else {
            chunk[i %% csize, ] <- runif(n)
        }
    }
    res[!is.na(res[, 1]), ]
}

The last approach is to grow list which is converted to a matrix at the end.

flist <- function(n) {
    res <- list()
    for (i in 1:n) {
        res[[length(res) + 1]] <- runif(n)
    }
    do.call(rbind, res)
}

n <- 1000
bench <- microbenchmark::microbenchmark(
    fmat(n),
    frbind(n),
    fchunks(n, csize = 10),
    flist(n),
    times = 1, unit = "ms"
)

expr	min	lq	mean	median	uq	max	neval
flist(n)	41.5171	41.5171	41.5171	41.5171	41.5171	41.5171	1
fmat(n)	66.3531	66.3531	66.3531	66.3531	66.3531	66.3531	1
fchunks(n, csize = 10)	692.2926	692.2926	692.2926	692.2926	692.2926	692.2926	1
frbind(n)	6115.4161	6115.4161	6115.4161	6115.4161	6115.4161	6115.4161	1

The performance of the rbind approach is really terrifyingly bad. The list approach on the other hand is extremely efficient. It performs equally well as the preallocated matrix approach. I unfortunately lack the understanding of the R internals here, but it seems as if dynamically growing a list does not have any (or at least not much) overhead.

Update

Lluís Revilla suggested that cbind might be more efficient than rbind, given that R mostly deals in columns.

fcbind <- function(n) {
    res <- vector()
    for (i in 1:n) {
        res <- cbind(res, runif(n))
    }
    t(res)
}

n <- 1000
bench <- microbenchmark::microbenchmark(
    fmat(n),
    frbind(n),
    fcbind(n),
    fchunks(n, csize = 10),
    flist(n),
    times = 1, unit = "ms"
)

expr	min	lq	mean	median	uq	max	neval
flist(n)	37.3641	37.3641	37.3641	37.3641	37.3641	37.3641	1
fmat(n)	46.4889	46.4889	46.4889	46.4889	46.4889	46.4889	1
fchunks(n, csize = 10)	565.9169	565.9169	565.9169	565.9169	565.9169	565.9169	1
fcbind(n)	1330.2777	1330.2777	1330.2777	1330.2777	1330.2777	1330.2777	1
frbind(n)	6103.8364	6103.8364	6103.8364	6103.8364	6103.8364	6103.8364	1

So cbind is indeed a lot faster than rbind, but still much worse than the list approach.

< section class="quarto-appendix-contents" id="quarto-reuse">

Reuse

CC BY 4.0

< section class="quarto-appendix-contents" id="quarto-citation">

Citation

BibTeX citation:

@online{schoch2024,
  author = {Schoch, David},
  title = {Be Kind Don’t Rbind},
  date = {2024-02-13},
  url = {http://blog.schochastics.net/posts/2024-02-13_be-kind-dont-rbind},
  langid = {en}
}

For attribution, please cite this work as:

Schoch, David. 2024. “Be Kind Don’t Rbind.” February 13, 2024. http://blog.schochastics.net/posts/2024-02-13_be-kind-dont-rbind.

To leave a comment for the author, please follow the link and comment on their blog: schochastics - all things R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reuse

Citation

Related