Hash Table Performance in R: Part IV
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the last post I introduced the package envestigate that provides the hash table structure and interesting statistics associated with an R environment. Now I want to show you some performance characteristics of the R environment as a hash table.
I’ll be using a synthetic list of distinct words that mimics real world data scraped from Wikipedia which you can scope out here.
Hash Table Construction Time
As I alluded in Part III, while R allows you to set the size of an environment with new.env(size=X) it will automatically resize the internal hash table once the load factor reaches a certain threshold (.85 to be exact). Let’s see how that affects hash table construction time.
The code below creates a hash table and inserts each word from the DISTINCT.1mil set. We accumulate the time it takes to insert words 1000 at a time. We also note the size of the hash table after every 1000 inserts. This will reveal an interesting spike that occurs every time the hash table increases in size. There are other spikes that happen and I suspect those are due to garbage collection.
library(envestigate) library(readr); library(dplyr) library(ggplot2); library(scales) # Character vector of distinct words DISTINCT <- read_lines('DISTINCT.1mil') len_DISTINCT <- length(DISTINCT) # Accumulate data each ith_insert ith_insert <- 10^3 # Number of observations n_observations <- len_DISTINCT/ith_insert # Collect these items each ith_insert insert <- numeric(n_observations) size <- numeric(n_observations) construct_time <- numeric(n_observations) # New environment with default hash size of 29L e <- new.env() i <- j <- 1L elapsed_time <- 0.0 while(i <= len_DISTINCT){ t1 <- proc.time()['elapsed'] # Insert word into environment e[[DISTINCT[i]]] <- 1 t2 <- proc.time()['elapsed'] # t2 - t1 = time to insert one word # elapsed_time = accumulated time of inserts elapsed_time <- elapsed_time + (t2 - t1) # ith_insert occures here, collect data if (i %% ith_insert == 0){ ht <- hash_table(e) # gather size of current hash table insert[j] <- i size[j] <- ht$size construct_time[j] <- elapsed_time elapsed_time <- 0.0 j <- j + 1L } i <- i + 1L } # Our data frame of results res <- data_frame(insert, size, construct_time) res ## Source: local data frame [1,000 x 3] ## ## insert size construct_time ## 1 1000 849 0.006 ## 2 2000 1758 0.009 ## 3 3000 2530 0.004 ## 4 4000 3643 0.005 ## 5 5000 4371 0.003 ## 6 6000 5245 0.005 ## 7 7000 6294 0.007 ## 8 8000 7552 0.011 ## 9 9000 7552 0.003 ## 10 10000 7552 0.003 ## .. ... ... ... # Use a bimodal color scheme to denote when a hash size changes num_uniq_sizes <- length(unique(res$size)) cols <- rep(hue_pal()(2),ceiling(num_uniq_sizes/2)) p <- ggplot(res, aes(x=insert,y=construct_time,color=factor(size))) + geom_line(size=2) + scale_color_manual(values=cols) + guides(col=guide_legend(ncol=2,title="Hash Size")) + labs(x="Number of Elements in Environment", y="Construct Time (seconds)") + theme( axis.text = element_text(colour = "black")) + ggtitle('Environment Construct Time') p
Some interesting takeaways:
- There’s a clear spike in construct time occurring every time the line color changes, signifying a change in the hash size. We presume the hash resize occurs during the spike. Also looks like linear growth.
- Other spikes appear even when the hash size doesn’t change. We presume that’s due to the garbage collector either getting rid of old R objects or allocating more heap space.
- The length of the colored lines along the x axis are growing as more inserts occur. It might follow a linear growth, but I haven’t looked at the R source code to confirm.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.