Site icon R-bloggers

Timing hash functions with the bench package

[This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This blog post has two goals

What is digest?

The digest package provides a hash function to summarise R objects. Standard hashes are available, such as md5, crc32, sha-1, and sha-256.

The key function in the package is digest() that applies a cryptographical hash function to arbitrary R objects. By default, the objects are internally serialized using md5. For example,

library("digest")
digest(1234)

## [1] "37c3db57937cc950b924e1dccb76f051"

digest(1234, algo = "sha256")

## [1] "01b3680722a3f3a3094c9845956b6b8eba07f0b938e6a0238ed62b8b4065b538"

The digest package is fairly popular and has a large number of reverse dependencies

length(tools::package_dependencies("digest", reverse = TRUE)$digest)

## [1] 186

The number of available hashing algorithms has grown over the years, and as a little side project, we decided to test the speed of the various algorithms. To be clear, I’m not considering any security aspects or the potential of hash clashes, just pure speed.

Timing in R

There are numerous ways of timing R functions. A recent addition to this list is the bench package. The main function bench::mark() has a number of useful features over other timing functions.

To time and compare two functions, we load the relevant packages

library("bench")
library("digest")
library("tidyverse")

then we call the mark() function and compare the md5 with the sha1 hash

value = 1234
mark(check = FALSE, 
     md5 = digest(value, algo = "md5"), 
     sha1 = digest(value, algo = "sha256")) %>%
  select(expression,  median)

## # A tibble: 2 x 2
##   expression   median
##   <bch:expr> <bch:tm>
## 1 md5          50.2µs
## 2 sha1         50.9µs

The resulting tibble object, contains all the timing information. For simplicity, we’ve just selected the expression and median time.

More advanced bench

Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more than one value. For example, in the digest package there are eight different algorithms. Ranging from the standard md5 to the newer xxhash64 methods. To compare times, we’ll generate n = 20 random character strings of length N = 10,000. This can all be wrapped up in the single function press() function call from the bench package:

N = 1e4
results = suppressMessages(
  bench::press(
    value = replicate(n = 20, paste0(sample(LETTERS, N, replace = TRUE), collapse = "")),
    {
      bench::mark(
        iterations = 1, check = FALSE,
        md5      = digest(value, algo = "md5"),
        sha1     = digest(value, algo = "sha1"),
        crc32    = digest(value, algo = "crc32"),
        sha256   = digest(value, algo = "sha256"),
        sha512   = digest(value, algo = "sha512"),
        xxhash32 = digest(value, algo = "xxhash32"),
        xxhash64 = digest(value, algo = "xxhash64"),
        murmur32 = digest(value, algo = "murmur32")
      )
    }
  )
)

The tibble results contain timing results. But it’s easier to work with relative timings. So we’ll rescale

rel_timings = results %>%
  unnest() %>%
  select(expression, median) %>%
  mutate(expression = names(expression)) %>%
  distinct() %>%
  mutate(median_rel = unclass(median/min(median)))

Then plot the results, ordered by slowed to largest

ggplot(rel_timings) + 
  geom_boxplot(aes(x = fct_reorder(expression, median_rel), 
                   y = median_rel)) + 
  theme_minimal() + 
  ylab("Relative timings") + xlab(NULL) + 
  ggtitle("N = 10,000") + coord_flip() 

The sha256 algorithm is about three times slower than the xxhash32 method. However, it’s worth bearing in mind that although it’s relatively slower, the absolute times are very small

rel_timings %>%
  group_by(expression) %>%
  summarise(median = median(median)) %>%
  arrange(desc(median))

## # A tibble: 8 x 2
##   expression   median
##   <chr>      <bch:tm>
## 1 sha256      171.2µs
## 2 sha1        112.6µs
## 3 md5         109.5µs
## 4 sha512      108.4µs
## 5 crc32          91µs
## 6 xxhash64     85.1µs
## 7 murmur32     82.1µs
## 8 xxhash32     77.6µs

It’s also worth seeing how the results vary according to the size of the character string N.

Regardless of the value of N, the sha256 algorithm is consistently in the slowest.

Conclusion

R is going the way of “tidy” data. Though it wasn’t the focus of this blog post, I think that the bench package is as good as other timing packages out there. Not only that, but it fits in with the whole “tidy” data thing. Two birds, one stone.


The post Timing hash functions with the bench package appeared first on Jumping Rivers.

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.