Improve The Efficiency in Joining Data with Index
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When managing big data with R, many people like to use sqldf() package due to its friendly interface or choose data.table() package for its lightening speed. However, very few would pay special attentions to small details that might significantly boost the efficiency of these packages by adding index to the data.frame or data.table.
In my post on 01/29/2013 (http://statcompute.wordpress.com/2013/01/29/another-benchmark-for-joining-two-data-frames), I’ve shown how to effectively join two data.frames / data.tables. However, the example is not intuitive for people to fully understand the benefit of adding index. In the demonstration below, I will compare 2 scenarios, one with the index and the other without, to show the extra efficiency gained by a simple index.
It is also important to note that creating the index in “ldf” would have the effect of adding the data.frame “ldf” from the R workspace to SQLite database. Therefore, in the 2nd “select…” statement, we need to add “main.” in front of “ldf” in order to use the indexed table “ldf” in SQLite instead of the unindexed table “ldf” in the R environment.
As shown in the benchmark table, simply adding an index can significantly reduce the user time with sqldf package and improves somewhat with data.table package.
libs <- c('sqldf', 'data.table', 'rbenchmark') lapply(libs, require, character.only = T) n <- 1000000 set.seed(1) ldf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), x1 = rnorm(n), x2 = runif(n)) rdf <- data.frame(id1 = sample(n, n), id2 = sample(n / 1000, n, replace = TRUE), y1 = rnorm(n), y2 = runif(n)) benchmark(replications = 5, order = "user.self", noindex.sqldf = (sqldf('select * from ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2')), indexed.sqldf = (sqldf(c('create index ldx on ldf(id1, id2)', 'select * from main.ldf as l inner join rdf as r on l.id1 = r.id1 and l.id2 = r.id2'))) ) benchmark(replications = 5, order = "user.self", noindex.table = { ldt <- data.table(ldf) rdt <- data.table(rdf) merge(ldt, rdt, by = c('id1', 'id2')) }, indexed.table = { ldt <- data.table(ldf, key = 'id1,id2') rdt <- data.table(rdf, key = 'id1,id2') merge(ldt, rdt, by = c('id1', 'id2')) } )
SQLDF OUTCOMES
test replications elapsed relative user.self sys.self user.child 2 indexed.sqldf 5 34.774 1.000 34.511 0.244 0 1 noindex.sqldf 5 61.873 1.779 44.918 16.941 0
DATA.TABLE OUTCOMES
test replications elapsed relative user.self sys.self user.child 2 indexed.table 5 6.719 1.000 6.609 0.104 0 1 noindex.table 5 6.777 1.009 6.696 0.076 0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.