Union Multiple Data.Frames with Different Column Names
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed before (https://statcompute.wordpress.com/2018/09/08/playing-map-and-reduce-in-r-subsetting).
rbind2 <- function(lst) { h <- unique(unlist(lapply(lst, names))) Reduce(rbind, parallel::mcMap(function(x) {x[, setdiff(h, names(x))] <- NA; return(x)}, lst, mc.cores = length(lst))) }
On Saturday, when I revisited the problem, I found a very good thread on the stackoverflow (https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns) discussing various approaches addressing my problem yesterday. Out of curiosity, I did a comparison between the rbind2() and discussed approaches by combining 8 data.frames each with a million records. As shown in the plot, my homebrew rbind2() function is only marginally faster than the gtools::smartbind() function and the dplyr::bind_rows function is the most efficient.
n <- 1000000 d1 <- data.frame(id = 1:n, x1 = 1) d2 <- data.frame(id = 1:n, x2 = 2) d3 <- data.frame(id = 1:n, x3 = 3) d4 <- data.frame(id = 1:n, x4 = 4) d5 <- data.frame(id = 1:n, x5 = 5) d6 <- data.frame(id = 1:n, x6 = 6) d7 <- data.frame(id = 1:n, x7 = 7) d8 <- data.frame(id = 1:n, x8 = 8) microbenchmark::microbenchmark(times = 10, "homebrew::rbind2" = {rbind2(list(d1, d2, d3, d4, d5, d6, d7, d8))}, "gtools::smartbind" = {gtools::smartbind(list = list(d1, d2, d3, d4, d5, d6, d7, d8))}, "dplyr::bind_rows" = {dplyr::bind_rows(d1, d2, d3, d4, d5, d6, d7, d8)}, "plyr::rbind.fill" = {plyr::rbind.fill(d1, d2, d3, d4, d5, d6, d7, d8)}, "data.table::rbindlist" = {data.table::rbindlist(list(d1, d2, d3, d4, d5, d6, d7, d8), fill = T)} )
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.