Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hash Table Performance in R: Part III
In Part I of this series, I explained how R hashed environments are superior to vectors or lists when you’re in need of an associative array. In Part II I explained the three main operations you want to perform on hash tables and how you should implement them for optimal performance.
This time around I want to introduce a package I’ve created that “peaks behind the curtain” of R environments, which I’ll use later to evaluate more performance characteristics.
Under Envestigation
The envestigate package exposes the hash table and its statistics that back the R environment. I consider this an experimental package so I don’t intend to push it to CRAN any time soon. However it properly uses the R C API and was developed the proper way, so it should work on any late model R version across all platforms.
library(devtools) # Make sure you've already installed this install_github('jeffreyhorner/envestigate') library(envestigate)
Hash Table Implementation
Hash tables are commonly implemented as an array of buckets with each bucket containing a linked list. They use a hash function to compute an index into the array of buckets so that the correct value can be found. A perfect hash function ‘hashes’ each key to a unique bucket to store the value, but they are rarely perfect. It is stated that when two keys hash to the same bucket a collision occures, and collisions are resolved by traversing the bucket’s linked list until the correct value is found.
In R, environment hash tables are a vector of lists where each element of the list contains a binding from symbol to value. The envestigate::hash_table function transforms that structure into a list of character vectors where each element contains only the symbol name as a string and ignores the value, thus exposing the internal structure so we can explore:
# e - an environment who's hash table has 5 buckets e <- new.env(size=5L) e[['Tukey']] <- 1 # Values aren't important e[['Fisher']] <- 1 e[['Bayes']] <- 1 e[['Pearson']] <- 1 ht <- envestigate::hash_table(e) # List the bucket contents ht$buckets ## [[1]] ## [1] "Bayes" "Tukey" ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## [1] "Pearson" ## ## [[5]] ## [1] "Fisher"
So we see that the internal hash table of 5 buckets has two empty buckets, bucket 2 and 3, and 3 non-empty buckets with bucket 1 containing two elements.
hash_table function
Let’s explore the ‘hash_table’ object a little further. The envestigate package provides a simple print function for hash table objects:
ht ## $size ## [1] 5 ## ## $num_entries ## [1] 4 ## ## $num_collisions ## [1] 1 ## ## $num_empty_buckets ## [1] 2 ## ## $num_non_empty_buckets ## [1] 3 ## ## $load_factor ## [1] 0.6 ## ## $max_bucket_length ## [1] 2
So
- size – Number of buckets in the hash table, e.g. the size of the hash table.
- num_entries – Total number of entries (or elements or variables) in the hash table.
- num_collisions – Total number of collisions, counted as the sum of 1 – length of each list.
- num_empty_buckets – Total number of buckets that contain zero elements.
- num_non_empty_buckets – Total number of buckets that contain at least one element.
- load_factor – Ratio of non empty buckets to size.
- max_bucket_length – Length of bucket with the most elements.
And there’s more. The following items are also part of the object but are typically too large for a simple print summary. They are:
- buckets – List of the buckets and their associated lists, which we saw above.
- empty_buckets – Vector of indexes for the empty buckets.
- non_empty_buckets – Vector of indexes for non empty buckets.
- bucket_counts – Named intenger vector where names are indexes of and elements are the counts of each bucket.
- bucket_size_dist – Data frame histogram of the number of buckets for each unique bucket size.
A note on the load factor. It is typically defined in the literature as the total number of elements (or entries) in the hash table divided by the size of the hash table. However, computing that value requires a complete scan of the table and this can have performance consequenses for instance when the table needs to be resized. So just using a ratio of non empty buckets to size is a cheaper to compute approximation. This is how R determines when to resize a hash table.
Computing Hash Values
envestigate also exposes the hash function that R uses to compute the hash value for strings, values that are used to map the strings to their buckets. Defining hash functions for computing hash values, a well researched topic, is beyond the scope of this note so I’ll just point to R’s implementation here, authored by Peter J. Weinberger.
The function envestigate::hash_value maps character vectors to integer values:
envestigate::hash_value(c('Tukey','Fisher','Bayes','Pearson')) ## Tukey Fisher Bayes Pearson ## 6013385 80780994 4755395 112761358
So those hash values are pretty meaningless at this point as you can tell non of them are in the 1 to 5 range, the size of our hash table. But they are a consistent, reproducible, one to one mapping from unique string to unique value, and we can use that result to compute the bucket index using the %% or modulo operator:
(hash_value(c('Tukey','Fisher','Bayes','Pearson')) %% ht$size) + 1 ## Tukey Fisher Bayes Pearson ## 1 5 1 4
or use the convenient envestigate::hash_index function:
hash_index(c('Tukey','Fisher','Bayes','Pearson'),ht) ## Tukey Fisher Bayes Pearson ## 1 5 1 4
Now we can compute on R’s internal environment hash tables just like R. For instance looking up a variable in an environment is nearly equivalent to this operation for looking up the bucket for the string ‘Tukey’:
index <- hash_index('Tukey',ht) bucket <- ht$buckets[[ index ]] bucket ## [1] "Bayes" "Tukey" for (i in 1:length(bucket)){ if ('Tukey' == bucket[i]){ cat('Found Tukey in bucket',index,'and list element',i,'!n') } } ## Found Tukey in bucket 1 and list element 2 !
So envestigate away, and share what you come up with!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.