Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hash Table Performance in R: Part II
In Part I of this series, I explained how R hashed environments are superior to vectors or lists when you need a hash table for your work. I also teased that in this post I would explain the caveats associated with that choice, but I’m saving that for later as I want to share with you the fastest ways of operating on them.
Operations on Hash Tables
There are three main operations you want to perform on hash tables. Given a key (as a string) and a value:
- INSERT will add the key/value pair to the hash table,
- LOOKUP will search the hash table for a key and return a value if it exists,
- and DELETE will delete a value based on the key.
There are other operations of some importance. For instance EXISTS is a special case of LOOKUP which returns true if the key exists and false otherwise. But the main operations are the performance heavyweights so we’ll restrict our discussion to those.
INSERT
So what’s the fastest way to insert a key into an R hashed environment? We could use the assign function which will assign a variable specified as a string along with a value into an environment, but we could also use an indexing construct available to other R objects. Turns out this is not really documented where one would expect. Anyone have a reference?
Using assign to insert
# Our hash table ht is a hashed environment ht <- new.env(); key <- "foo"; value <- 1 assign(key, value, ht, inherits=FALSE) # only look in this environment, not in the parent ls.str(ht) ## foo : num 1
Using the [[ construct
ht <- new.env(); key <- "foo"; value <- 1 ht[[key]] <- value ls.str(ht) ## foo : num 1
$ Doesn’t work as expected
ht <- new.env(); key <- "foo"; value <- 1 ht$key <- value # works, but not what you expect ls.str(ht) ## key : num 1
Uh oh! Our key is “foo” not “key”. R doesn’t expand the variable key to “foo” like we want, so this is not helpful.
We could do this:
ht <- new.env(); key <- "foo"; value <- 1 ht$"foo" <- value # works, but have you seen anything uglier? ls.str(ht) ## foo : num 1
but it’s more likeley that your key will already be assigned to a variable and not expressed as a constant directly in your code.
Total Fail on [
ht <- new.env(); key <- "foo"; value <- 1 ht[key] <- value # error ## Error in ht[key] <- value: object of type 'environment' is not subsettable
So we’ve got two ways to perform lookup, but which is faster?
Our test will compare the call to two similarly designed functions but with one using assign and the other using [[ to perform INSERT.
library(microbenchmark) # on CRAN, install with install.packages('microbenchmark') INSERT_assign <- function(){ ht <- new.env() assign("foo",1,ht,inherits=FALSE) } INSERT_subset <- function(){ ht <- new.env() ht[["foo"]] <- 1 } # Perform a high number of each, 10L^5. Might need more if you have a fast machine microbenchmark(INSERT_assign(), INSERT_subset(), times=10L^5) ## Unit: microseconds ## expr min lq mean median uq max neval ## INSERT_assign() 1.435 1.718 2.204589 1.899 2.138 3073.382 1e+05 ## INSERT_subset() 1.105 1.313 1.670967 1.470 1.663 1331.484 1e+05
Clear winner with [[!
LOOKUP
For constructing this operation, let’s compare the R function get which works with environments to the [[ construct. $ and [ obviously won’t work. We’ll set up our test just like the one in INSERT where the functions perform equivalent work except for the expression we want to test.
# Each returns the value of the key "foo" LOOKUP_get <- function(){ ht <- new.env() ht[["foo"]] <- 1 get("foo",envir=ht,inherits=FALSE) } LOOKUP_subset <- function(){ ht <- new.env() ht[["foo"]] <- 1 ht[["foo"]] } microbenchmark(LOOKUP_get(), LOOKUP_subset(), times=10L^5) ## Unit: microseconds ## expr min lq mean median uq max neval ## LOOKUP_get() 1.835 2.196 2.758611 2.417 2.661 2914.613 1e+05 ## LOOKUP_subset() 1.260 1.518 1.876052 1.687 1.863 1319.268 1e+05
And again, [[ is faster!
DELETE
So how does one delete a key/value pair from R hashed environments? Turns out there’s only one way: use rm as assigning NULL won’t work (like it would with lists):
ht <- new.env(); ht[["foo"]] <- 1 ht[["foo"]] <- NULL ls.str(ht) ## foo : NULL rm(list="foo",envir=ht,inherits=FALSE) ls.str(ht) # Produces nothing
In Conclusion
If you are going to use R hashed environments as a hash table, then it would behoove you to construct the main operations INSERT, LOOKUP, and DELETE with [[ rather than assign and get like so:
# Simple Hash Table Interface for R HASH_TABLE <- function() new.env() INSERT <- function(key, value, ht) ht[[key]] <- value LOOKUP <- function(key, ht) ht[[key]] DELETE <- function(key, ht) rm(list=key,envir=ht,inherits=FALSE)
But! I’ll leave you with one last performance comparison:
ht <- HASH_TABLE() INSERT("key",1L,ht) ls.str(ht) ## key : int 1 microbenchmark( LOOKUP("foo",ht)==1L, ht[["foo"]]==1L, times=10L^5 ) ## Unit: nanoseconds ## expr min lq mean median uq max neval ## LOOKUP("foo", ht) == 1L 410 457 633.9799 551 665 1036167 1e+05 ## ht[["foo"]] == 1L 172 198 258.6643 215 288 34987 1e+05
So is an interface really necessary?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.