Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This blog post has two goals
- Investigate the bench package for timing R functions
- Consequently explore the different algorithms in the digest package using bench
What is digest?
The digest package provides a hash function to summarise R objects. Standard hashes are available, such as md5, crc32, sha-1, and sha-256.
The key function in the package is digest()
that applies a cryptographical hash function to arbitrary R objects. By default, the objects are internally serialized using md5
. For example,
library("digest") digest(1234) ## [1] "37c3db57937cc950b924e1dccb76f051" digest(1234, algo = "sha256") ## [1] "01b3680722a3f3a3094c9845956b6b8eba07f0b938e6a0238ed62b8b4065b538"
The digest package is fairly popular and has a large number of reverse dependencies
length(tools::package_dependencies("digest", reverse = TRUE)$digest) ## [1] 186
The number of available hashing algorithms has grown over the years, and as a little side project, we decided to test the speed of the various algorithms. To be clear, I’m not considering any security aspects or the potential of hash clashes, just pure speed.
Timing in R
There are numerous ways of timing R functions. A recent addition to this list is the bench package. The main function bench::mark() has a number of useful features over other timing functions.
To time and compare two functions, we load the relevant packages
library("bench") library("digest") library("tidyverse")
then we call the mark()
function and compare the md5
with the sha1
hash
value = 1234 mark(check = FALSE, md5 = digest(value, algo = "md5"), sha1 = digest(value, algo = "sha256")) %>% select(expression, median) ## # A tibble: 2 x 2 ## expression median ## <bch:expr> <bch:tm> ## 1 md5 50.2µs ## 2 sha1 50.9µs
The resulting tibble
object, contains all the timing information. For simplicity, we’ve just selected the expression and median time.
More advanced bench
Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark()
, as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more than one value. For example, in the digest package there are eight different algorithms. Ranging from the standard md5
to the newer xxhash64
methods. To compare times, we’ll generate n = 20
random character strings of length N = 10,000
. This can all be wrapped up in the single function press()
function call from the bench package:
N = 1e4 results = suppressMessages( bench::press( value = replicate(n = 20, paste0(sample(LETTERS, N, replace = TRUE), collapse = "")), { bench::mark( iterations = 1, check = FALSE, md5 = digest(value, algo = "md5"), sha1 = digest(value, algo = "sha1"), crc32 = digest(value, algo = "crc32"), sha256 = digest(value, algo = "sha256"), sha512 = digest(value, algo = "sha512"), xxhash32 = digest(value, algo = "xxhash32"), xxhash64 = digest(value, algo = "xxhash64"), murmur32 = digest(value, algo = "murmur32") ) } ) )
The tibble results
contain timing results. But it’s easier to work with relative timings. So we’ll rescale
rel_timings = results %>% unnest() %>% select(expression, median) %>% mutate(expression = names(expression)) %>% distinct() %>% mutate(median_rel = unclass(median/min(median)))
Then plot the results, ordered by slowed to largest
ggplot(rel_timings) + geom_boxplot(aes(x = fct_reorder(expression, median_rel), y = median_rel)) + theme_minimal() + ylab("Relative timings") + xlab(NULL) + ggtitle("N = 10,000") + coord_flip()
sha256
algorithm is about three times slower than the xxhash32
method. However, it’s worth bearing in mind that although it’s relatively slower, the absolute times are very small
rel_timings %>% group_by(expression) %>% summarise(median = median(median)) %>% arrange(desc(median)) ## # A tibble: 8 x 2 ## expression median ## <chr> <bch:tm> ## 1 sha256 171.2µs ## 2 sha1 112.6µs ## 3 md5 109.5µs ## 4 sha512 108.4µs ## 5 crc32 91µs ## 6 xxhash64 85.1µs ## 7 murmur32 82.1µs ## 8 xxhash32 77.6µs
It’s also worth seeing how the results vary according to the size of the character string N
.
N
, the sha256
algorithm is consistently in the slowest.
Conclusion
R is going the way of “tidy” data. Though it wasn’t the focus of this blog post, I think that the bench package is as good as other timing packages out there. Not only that, but it fits in with the whole “tidy” data thing. Two birds, one stone.
The post Timing hash functions with the bench package appeared first on Jumping Rivers.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.