[This article was first published on Rcrastinate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Which function rbinds dataframes together fastest?Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First competitor: classic rbind in a for loop over a list of dataframes
Second competitor: do.call(“rbind”, <list of dataframes>)
Third competitor: rbind.fill(<list of dataframes>) from the plyr package
The job:
– rbinding a list of dataframes with 4 columns each, one column is the splitting factor, the other 3 hold normally distributed random data
– the number of rows of the original dataframe is varied between 20,000; 50,000; 100,000; 200,000; 300,000; 400,000; 500,000 and 600,000 rows
– the number of levels for the splitting factor (hence the number of list elements after splitting) is varied between 6, 12 and 24 – the total number of rows for the original dataframe is held constant
The machine:
– A blazing fast late 2008 MacBook with a 2 GHz CPU and 4 GBs of RAM running Mountain Lion
– 32-bit R using RGui.app for Mac OS X
The results:
rbind.fill is the fastest function for each number of sub-dataframes (no surprises here). The classic rbind in a for loop is massively influenced by the number of sub-dataframes!
The code:
library(plyr)
time.df <- data.frame()
for (i in c(20000, 50000, 100000, 200000, 300000, 400000, 500000, 600000)) {
cat(i, “\n”)
df <- data.frame(a = rep(c(“A”, “B”, “C”, “D”, “E”, “F”), i),
b = sample(rnorm(i*6), i*6),
c = sample(rnorm(i*6), i*6),
d = sample(rnorm(i*6), i*6))
split.df <- split(df, df$a)
t1 <- Sys.time()
df1 <- data.frame()
for (subdf in split.df) {
df1 <- rbind(df1, subdf) }
t2 <- Sys.time()
t3 <- Sys.time()
df2 <- do.call(“rbind”, split.df)
t4 <- Sys.time()
t5 <- Sys.time()
df3 <- rbind.fill(split.df)
t6 <- Sys.time()
new.row <- data.frame(n = i*6,
classic = difftime(t2, t1),
docall = difftime(t4, t3),
rbindfill = difftime(t6, t5))
time.df <- rbind(time.df, new.row) }
Adapt the creation procedure of df for the different number of sub-dataframes…
To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.