hash-2.0.0

Christopher Brown

12 years ago

[This article was first published on Open Data Group » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

hash The hash-2.0.0 package has been uploaded to CRAN. This version was developed in conjunction with R-2.11.0 and was refactored for performance. hash-2.0.0 requires R-2.10.0 or later and will not be supported on earlier versions of R. This is a result of recent changes to the language itself.

Importantly: Understand that hash-2.0.0, breaks backward compatibility; code written with previous versions of the hash package are not guaranteed to work with this or future versions. This is due to changes made in order to achieve much higher performance. Assignments and look-ups are achieved more quickly through direct inheritance of environments, stripping of non-essential customizations and reliance on core and primitive functions.

Here is a summary of major changes:

Coercion of keys to valid R names ( i.e. non-blank character values) is not the responsibility of the user. The four accessor functions: [, [[, $, values, no longer do this automatically. An error results if a proper R name is not provided.

The default for missing keys has changed from NA to NULL. This is to match the behavior lists in trying to access non-existing objects in R. ( For a more complete, discussion, see my previous blog post discussing the differences between NA and NULL. )
- Custom behavior for accessing non-existent keys has been removed. Access to non-existing keys will always yield NULL. Consistency is often better than customization.

ChangeLog and TODO track many technical details; here I will discuss only the more important changes:

Performance

Included in this version is a demo script that runs benchmarks (demo(hash-benchmarks). One of the questions that has been repeatedly posed, often in the context of look-up, is: how does this compare to native R named lists and vectors? In other words, how much quicker is accessing a value on a hash / environment as opposed to a list (or vector)? This is a difficult questions, and generally depends on the size of the hash or list. My rule of thumb is that it is quicker to look-up elements on lists and vectors less than about 500 elements. After ~500 elements, hashes and environments greatly outperform lists. The difference increases relative to the size of the object. However, look-ups for all these objects are very fast if objects are small ( >120,000 / sec ). So unless you are doing many serial look-ups, hashes are likely the better option.

I have written previously about hashes in R [1] [2], and will continue to discuss the evolution of R hashes on this blog. Additionally I will be speaking on this and related work at useR!2010 (July 20-23.)

To leave a comment for the author, please follow the link and comment on their blog: Open Data Group » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.