Efficiency in Joining Two Data Frames
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In R, there are multiple ways to merge 2 data frames. However, there could be a huge disparity in terms of efficiency. Therefore, it is worthwhile to test the performance among different methods and choose the correct approach in the real-world work.
For smaller data frames with 1,000 rows, all six methods shown below seem to work pretty well except that the approach with sql.select() is significantly slower than the rest. The generic merge() function in the base package is a very natural choice without much overhead of loading additional libraries and converting data frame. sqldf() is also attractive in that it might be the most user-friendly function with a very intuitive syntax.
> n <- 1000 > set.seed(2013) > ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) > rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) > > # METHOD 1: MERGE > system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2"))) user system elapsed 0.032 0.012 0.064 > > # METHOD 2: DATA.TABLE > ldt <- data.table::data.table(ldf, key = c("id1", "id2")) > rdt <- data.table::data.table(rdf, key = c("id1", "id2")) > system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2"))) user system elapsed 0.028 0.000 0.044 > > # METHOD 3: FF > lff <- ff::as.ffdf(ldf) > rff <- ff::as.ffdf(rdf) > system.time(join3 <- merge(lff, rff, by = c("id1", "id2"))) user system elapsed 0.044 0.004 0.096 > > # METHOD 4: SQLDF > system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", + "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))) user system elapsed 0.168 0.008 0.332 > > # METHOD 5: PLYR > system.time(join5 <- plyr::join(ldf, rdf, by = c("id1", "id2"), type = "inner")) user system elapsed 0.088 0.020 0.152 > > # METHOD 6: SQL.SELECT > source("http://sqlselect.googlecode.com/svn/trunk/sql.select.R") > system.time(join6 <- sql.select("select * from ldf inner join rdf on (`ldf$id1 == rdf$id1 & ldf$id2 == rdf$id2`)")) user system elapsed 53.775 19.725 73.813
However, when it comes to mid-size data frames with 1,000,000 rows, the story has changed. First of all, out of six methods shown above, the last two fails directly due to the insufficient memory size in my 32-bit ubuntu virtual machine. In this case, data.table package shows a significant advantage after converting 2 data.frames to data.tables. In additional, it is interesting that although ff and sqldf packages are slower than merge() function for the small-size data with 1,000 rows, both of them seem slightly faster for the data with 1,000,000 rows.
> n <- 1000 ^ 2 > set.seed(2013) > ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) > rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) > > # METHOD 1: MERGE > system.time(join1 <- merge(ldf, rdf, by = c("id1", "id2"))) user system elapsed 55.223 12.437 68.054 > > # METHOD 2: DATA.TABLE > ldt <- data.table::data.table(ldf, key = c("id1", "id2")) > rdt <- data.table::data.table(rdf, key = c("id1", "id2")) > system.time(join2 <- merge(ldt, rdt, by = c("id1", "id2"))) user system elapsed 0.484 0.008 0.492 > > # METHOD 3: FF > lff <- ff::as.ffdf(ldf) > rff <- ff::as.ffdf(rdf) > system.time(join3 <- merge(lff, rff, by = c("id1", "id2"))) user system elapsed 49.811 13.821 64.004 > > # METHOD 4: SQLDF > system.time(join4 <- sqldf::sqldf(c("create index ldx on ldf(id1, id2)", + "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2"))) user system elapsed 40.418 1.268 42.076
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.