[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my post yesterday comparing efficiency in joining two data frames, I overlooked the computing cost used to convert data.frames to data.tables / ff data objects. Today, I did the test again with the consideration of library loading and data conversion. After the replication of 10 times in rbenchmark package, the joining method with data.table is almost 10 times faster than the other in terms of user time. Although ff package is claimed to be able to handle large-size data, its efficiency seems questionable.
n <- 1000000 set.seed(2013) ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 100, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) library(rbenchmark) benchmark(replications = 10, order = "user.self", # GENERIC MERGE() IN BASE PACKAGE merge = merge(ldf, rdf, by = c("id1", "id2")), # DATA.TABLE PACKAGE datatable = { ldt <- data.table::data.table(ldf, key = c("id1", "id2")) rdt <- data.table::data.table(rdf, key = c("id1", "id2")) merge(ldt, rdt, by = c("id1", "id2")) }, # FF PACKAGE ff = { lff <- ff::as.ffdf(ldf) rff <- ff::as.ffdf(rdf) merge(lff, rff, by = c("id1", "id2")) }, # SQLDF PACKAGE sqldf = sqldf::sqldf(c("create index ldx on ldf(id1, id2)", "select * from main.ldf inner join rdf on ldf.id1 = rdf.id1 and ldf.id2 = rdf.id2")) ) # test replications elapsed relative user.self sys.self user.child # 2 datatable 10 17.923 1.000 16.605 1.432 0 # 4 sqldf 10 105.002 5.859 102.294 3.345 0 # 1 merge 10 131.279 7.325 119.139 13.049 0 # 3 ff 10 187.150 10.442 154.670 33.758 0
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.