Fast data lookups in R: dplyr vs data.table

[This article was first published on Appsilon Data Science - R language, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is a vector-oriented language and most of the things you do in R is optimised for that, but what if you need something less typical… What if you need to find a specific element in a dataset? There are a lot of options to do that in R, but when your dataset has a few million rows or more lookups may be extremely slow. This is especially problematic in Shiny apps, where user should get a feedback on his/her action in a fraction of seconds. Otherwise user might get really frustrated when using your app!

In this post I’m going to compare different methods that can be used to improve lookup times in R. In our case, we were able to improve lookup speed 25 times by using data.table indexes.

First, let’s generate a sample dataset with 10 million rows:

library(tibble)

random_string_column <- function(n) {
  stringi::stri_rand_strings(n = n, length = 8)
}

random_data_frame <- function(n) tibble(
  col1 = random_string_column(n),
  col2 = random_string_column(n)
)

data <- random_data_frame(10^7)

To compare efficiency of different methods, let’s create a simple benchmarking routine:

time <- function(...) {
  time_measurement <- system.time(eval(...))
  time_measurement[["user.self"]]
}

benchmark <- function(..., n = 100) {
  times <- replicate(n, ...)
  c(min = min(times), max = max(times), mean = mean(times))
}

These routines benchmark a piece of code running it multiple times. In each run, we can have some preparation steps that are not included in the measured execution time – we will only measure execution time of code wrapped with a call to time(...).

dplyr::filter and data.table filtering

At Appsilon, we often use Hadley’s dplyr package for data manipulations. To find a specific element in a data frame (or, more precisely, a tibble) one can use dplyr’s filter function. Let’s see how long it takes to find an element with a specific key using dplyr.

library(dplyr)

benchmark({
  key_to_lookup <- select_random_key()
  time(data %>% filter(col1 == key_to_lookup))
})
##    min    max   mean 
## 0.0960 0.2440 0.1124

On average, finding a single element using filter takes 112 ms on my machine. This may not seem a lot, but in fact this is very slow – especially if you want to do several such operations at a time. Why is filter so slow? dplyr does not know anything about our dataset. The only way for it to find matching elements is to look row by row and check if the criteria is matched. When the dataset gets large, this simply must take time.
Filtering using data.table is similarly slow:

library(data.table)

data_table <- data.table(data)

benchmark({
  key_to_lookup <- select_random_key()
  time(data_table[col1 == key_to_lookup])
})
##     min     max    mean 
## 0.00100 5.99400 0.06194

In the worst case it took almost 6 seconds to find an element!

Builtin operators

You might be thinking: maybe it’s just dplyr and data.table? Maybe filtering the old-school way with builtin operators and filtering is faster?

benchmark({
  key_to_lookup <- select_random_key()
  time(data[data$col1 == key_to_lookup,])
})
##     min     max    mean 
## 0.09800 0.22300 0.11424

No dice. How about using which for selecting rows?

benchmark({
  key_to_lookup <- select_random_key()
  time(data[which(data$col1 == key_to_lookup),])
})
##     min     max    mean 
## 0.09200 0.15800 0.10471

Still the same. Dataframe filtering and which also use linear search, so as we could have expected they can’t be noticeably faster than dplyr::filter.

Hash tables to the rescue?

The problem of efficient lookups obviously is not R-specific. One of approaches is to use a hash table. In short, hash table is a very efficient data structure with constant lookup time. It is implemented in most of modern programming languages and widely used in many areas. If you’d like to read more about how hash tables work, there are lots of good articles about that.

R-bloggers has a great series of articles about hash tables in R: part 1, part 2, part 3. The main conclusion of those articles is that if you need a hash table in R, you can use a data structure that is built into R: environments. Environments are used to keep binding of variables to values. Internally, they are implemented as a hash table. There is even a CRAN package hash, which wraps environments for general-purpose usage.
However, these articles analyse performance of hash table with only 30 thousand elements… which is very small. Before we can use a hash.table, we need to create it. Building an environment with 10^7 elements is extremely slow: it takes more than an hour. For 10^6 elements it takes about 30 seconds.

# Try it yourself:

environment_source <- setNames(as.list(data$col2), data$col1)
time(as.environment(environment_source))

This makes using environments as hash tables virtually impossible if your dataset has million or more elements. For example, if you’d like to build a hash table when at Shiny app startup, you would have to wait ages before the app actually starts. Such startup time is not acceptable in commercial solutions. Apparently, environments were never intended for storing that many elements. Indeed, their basic role is to store variables that your R code execution creates. There are not many situations where you would need to create that many variables in your R code. A workaround could be to build the environment once and load it from file, but (quite surprisingly) this turns out to be even slower.

I couldn’t find a good hash tables package for R that was not using environments underneath - please let me know if you know one! I can imagine that an efficient hash table R package will be created in the future, but till then we need to use other solutions.

Data.table indexes

It’s unfortunate that we can’t use hash tables for large datasets. However, when the column you are searching by is a number or a string, there is another option. You may use ordered indexes from data.table package.

Index keeps a sorted version of our column, which leads to much faster lookups. Finding an element in a sorted set with binary search has a logarithmic execution time - complexity of O(log n). In practice index can be implemented in different ways, but usually the time needed to build up an index is O(n*log n), which in practice is often fast enough.

Let’s measure the time needed to build the index for out dataset:

time(setkey(data_table, col1))
5.645

6 seconds - this is not negligible, but this is something we can live with. Let’s check the lookup times:

benchmark({
  key_to_lookup <- select_random_key()
  time(data_table[.(key_to_lookup), nomatch = 0L])
})
##     min     max    mean 
## 0.00200 0.01200 0.00455

This is exactly what we wanted - a mean lookup time of 5 milliseconds! Finally, we’ve managed to get efficient lookups in our dataset.

Summary

Dplyr with its filter method will be slow if you want just to find a single element in a dataset. The same goes for classic data frame filtering with builtin R operators and for regular filtering using data.table. Using environment as a hash table gives you fast lookups, but building it for a large dataset takes ages.

If you need efficient lookups in a large dataset in R, check out data.table’s ordered indexes – this seems to be the best option out there!

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science - R language.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)