Concatenating a list of data frames

andrew

8 years ago

[This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s something that I do surprisingly often: concatenating a list of data frames into a single (possibly quite enormous) data frame. Until now my naive solution worked pretty well. However, today I needed to deal with a list of over 6 million elements. The result was hours of page thrashing before my R session finally surrendered. I suppose I should be happy that my hard disk survived.

I did a bit of research and found that there are a few solutions which are much (much!) more efficient.

The Problem

Let’s create some test data: a list consisting of 100 000 elements, each of which is a small data frame.

> data <- list()
> 
> N <- 100000
>
> for (n in 1:N) {
+   data[[n]] = data.frame(index = n, char = sample(letters, 1), z = runif(1))
+ }
> data[[1]]
  index char        z
1     1    t 0.221784

The Naive Solution

My naive solution to the problem was to use a combination of do.call() and rbind(). It gets the job done.

> head(do.call(rbind, data))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506

Alternative Solutions #1 and #2

The plyr package presents two options.

> library(plyr)
> 
> head(ldply(data, rbind))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506
> head(rbind.fill(data))
  index char          z
1     1    h 0.56891292
2     2    x 0.90331644
3     3    z 0.53675079
4     4    h 0.04587779
5     5    o 0.08608656
6     6    l 0.26410506

Both of these also do the job nicely.

Alternative Solution #3

Finally, a solution from the data.table package.

> library(data.table)
> 
> head(rbindlist(data))
   index char          z
1:     1    h 0.56891292
2:     2    x 0.90331644
3:     3    z 0.53675079
4:     4    h 0.04587779
5:     5    o 0.08608656
6:     6    l 0.26410506

Benchmarking

All of these alternatives produce the correct result. The solution of choice will be the fastest one (and the one causing the minimum of page thrashing!).

> library(rbenchmark)
> 
> benchmark(do.call(rbind, data), ldply(data, rbind), rbind.fill(data), rbindlist(data))
                  test replications  elapsed relative user.self sys.self user.child sys.child
1 do.call(rbind, data)          100 11387.82  668.692  11384.15     1.54         NA        NA
2   ldply(data, rbind)          100  4983.72  292.644   4982.90     0.52         NA        NA
3     rbind.fill(data)          100  1480.46   86.932   1480.23     0.17         NA        NA
4      rbindlist(data)          100    17.03    1.000     16.86     0.17         NA        NA

Thoughts on Performance

The naive solution uses the rbind.data.frame() method which is slow because it checks that the columns in the various data frames match by name and, if they don’t, will re-arrange them accordingly. rbindlist(), by contrast, does not perform such checks and matches columns by position.

rbindlist() is implemented in C, while rbind.data.frame() is coded in R.

Both of the plyr solutions are an improvement on the naive solution. However, relative to all of the other solutions, rbindlist() is blisteringly fast. Little wonder that my naive solution bombed out with a list of 6 million data frames. Using rbindlist(), however, it was done before I had finished my cup of coffee.

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.