Site icon R-bloggers

Hash Table Performance in R: Part IV

[This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the last post I introduced the package envestigate that provides the hash table structure and interesting statistics associated with an R environment. Now I want to show you some performance characteristics of the R environment as a hash table.

I’ll be using a synthetic list of distinct words that mimics real world data scraped from Wikipedia which you can scope out here.

Hash Table Construction Time

As I alluded in Part III, while R allows you to set the size of an environment with new.env(size=X) it will automatically resize the internal hash table once the load factor reaches a certain threshold (.85 to be exact). Let’s see how that affects hash table construction time.

The code below creates a hash table and inserts each word from the DISTINCT.1mil set. We accumulate the time it takes to insert words 1000 at a time. We also note the size of the hash table after every 1000 inserts. This will reveal an interesting spike that occurs every time the hash table increases in size. There are other spikes that happen and I suspect those are due to garbage collection.

library(envestigate)
library(readr); library(dplyr)
library(ggplot2); library(scales)

# Character vector of distinct words
DISTINCT <- read_lines('DISTINCT.1mil')
len_DISTINCT <- length(DISTINCT)

# Accumulate data each ith_insert
ith_insert <- 10^3

# Number of observations
n_observations <- len_DISTINCT/ith_insert

# Collect these items each ith_insert
insert      <- numeric(n_observations)
size           <- numeric(n_observations)
construct_time <- numeric(n_observations)

# New environment with default hash size of 29L
e <- new.env()

i <- j <- 1L
elapsed_time <- 0.0
while(i <= len_DISTINCT){
  t1 <- proc.time()['elapsed']
  # Insert word into environment
  e[[DISTINCT[i]]] <- 1
  t2 <- proc.time()['elapsed']

  # t2 - t1 = time to insert one word
  # elapsed_time = accumulated time of inserts
  elapsed_time <- elapsed_time + (t2 - t1)

  # ith_insert occures here, collect data
  if (i %% ith_insert == 0){
    ht <- hash_table(e) # gather size of current hash table
    insert[j]              <- i
    size[j]                   <- ht$size
    construct_time[j]         <- elapsed_time

    elapsed_time <- 0.0
    j <- j + 1L
  }
  i <- i + 1L
}
# Our data frame of results
res <- data_frame(insert, size, construct_time)
res

## Source: local data frame [1,000 x 3]
## 
##    insert size construct_time
## 1    1000  849          0.006
## 2    2000 1758          0.009
## 3    3000 2530          0.004
## 4    4000 3643          0.005
## 5    5000 4371          0.003
## 6    6000 5245          0.005
## 7    7000 6294          0.007
## 8    8000 7552          0.011
## 9    9000 7552          0.003
## 10  10000 7552          0.003
## ..    ...  ...            ...

# Use a bimodal color scheme to denote when a hash size changes
num_uniq_sizes <- length(unique(res$size))
cols <- rep(hue_pal()(2),ceiling(num_uniq_sizes/2))

p <- ggplot(res, aes(x=insert,y=construct_time,color=factor(size))) +
  geom_line(size=2) + scale_color_manual(values=cols) +
  guides(col=guide_legend(ncol=2,title="Hash Size")) +
  labs(x="Number of Elements in Environment", y="Construct Time (seconds)") +
  theme( axis.text = element_text(colour = "black")) +
    ggtitle('Environment Construct Time')
p

Some interesting takeaways:

To leave a comment for the author, please follow the link and comment on their blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.