Site icon R-bloggers

Union Multiple Data.Frames with Different Column Names

[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed before (https://statcompute.wordpress.com/2018/09/08/playing-map-and-reduce-in-r-subsetting).

rbind2 <- function(lst) {
  h <- unique(unlist(lapply(lst, names)))
  Reduce(rbind, parallel::mcMap(function(x) {x[, setdiff(h, names(x))] <- NA; return(x)}, lst, mc.cores = length(lst)))
}

On Saturday, when I revisited the problem, I found a very good thread on the stackoverflow (https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns) discussing various approaches addressing my problem yesterday. Out of curiosity, I did a comparison between the rbind2() and discussed approaches by combining 8 data.frames each with a million records. As shown in the plot, my homebrew rbind2() function is only marginally faster than the gtools::smartbind() function and the dplyr::bind_rows function is the most efficient.

n <- 1000000
d1 <- data.frame(id = 1:n, x1 = 1)
d2 <- data.frame(id = 1:n, x2 = 2)
d3 <- data.frame(id = 1:n, x3 = 3)
d4 <- data.frame(id = 1:n, x4 = 4)
d5 <- data.frame(id = 1:n, x5 = 5)
d6 <- data.frame(id = 1:n, x6 = 6)
d7 <- data.frame(id = 1:n, x7 = 7)
d8 <- data.frame(id = 1:n, x8 = 8)
microbenchmark::microbenchmark(times = 10, 
  "homebrew::rbind2"      = {rbind2(list(d1, d2, d3, d4, d5, d6, d7, d8))},
  "gtools::smartbind"     = {gtools::smartbind(list = list(d1, d2, d3, d4, d5, d6, d7, d8))},
  "dplyr::bind_rows"      = {dplyr::bind_rows(d1, d2, d3, d4, d5, d6, d7, d8)},
  "plyr::rbind.fill"      = {plyr::rbind.fill(d1, d2, d3, d4, d5, d6, d7, d8)},
  "data.table::rbindlist" = {data.table::rbindlist(list(d1, d2, d3, d4, d5, d6, d7, d8), fill = T)}
)

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.