Site icon R-bloggers

Caching in R

[This article was first published on Posts | Joshua Cook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Caching intermediate objects in R can be an efficient way to avoid re-evaluating long-running computations. The general process is always the same: run the chunk of code once, store the output to disk, and load it up the next time the same chunk is run. There are, of course, multiple packages in R to help with this process, so I’ve decided to outline some of the more popular options below.

One of the most important features of any caching system is its ability to detect if the cache has become “stale,” that is, when the object on disk is no longer valid because the dependencies of the cached object have changed. This feature is specifically discussed in the sections for each caching method, but, briefly, there are systems for cache invalidation in R Markdown, ‘R.cache’, ‘mustashe,’ and ‘ProjectTemplate.’

Options

Here are the options for caching in R that I will discuss below, and each has a link to more information on that specific option:

TL;DR

For my final synopsis on when to use each package, skip to the Conclusion.

Caching a code chunk in R Markdown

R Markdown has a built-in caching feature that can be enabled by setting cache=TRUE in the chunk’s header.

```{r import-df, cache=TRUE}
df <- read_tsv("data-file.tsv")
```

The second time the chunk is run, both the visual output and any objects created are loaded from disk. If you are already using R Markdown for your project or work, this is probably the only caching mechanism you will need.

R Markdown does have a method for detecting cache invalidation, though it is not explicitly supported by ‘knitr.’ The basic idea is to set another chunk option that computes some value that, if it changes, should trigger cache invalidation. For instance, say we are reading in a file from disk and want the chunk to re-run if it changes. We can create a new chunk option called cache.extra and assign it some value to indicate if the file has changed, such as the modification date.

```{r import-df, cache=TRUE, cache.extra=file.mtime("data-file.tsv")}
df <- read_tsv("data-file.tsv")
```

Now if the file is modified, the cache for the code chunk will be invalidated and the code will be re-run.

‘memoise’

The ‘memoise’ package brings in the function memoise(). When a function is “memoised,” the inputs and outputs are remembered so that if a function is passed the same inputs multiple times, the previously computed output can be returned immediately, without re-evaluating the function call. This is an optimization technique from dynamic programming.

The memoise() function is passed a function and returns a new function with the same properties as the original, except it is now memoised (it returns TRUE when passed to is.memoised()). Below is an example where sq(), a simple function that squares its input, is memoised as memo_sq(). A print statement is included in the sq() function to indicate when it has actually been run.

library(memoise)
sq <- function(x) {
print("Computing square of 'x'")
x**2
}
memo_sq <- memoise(sq)

The first time memo_sq(2) is run, the function is evaluated and we see the print statement’s message.

memo_sq(2)

#> [1] "Computing square of 'x'"
#> [1] 4

However, the second time, the result is loaded from disk and we see no message.

memo_sq(2)

#> [1] 4

Optionally, a local directory, AWS S3 bucket, or Google Cloud Storage location can be passed as the location to save the cached data (i.e. paired inputs and outputs). This can be useful for storing the memoised values across multiple R sessions.

As far as I am aware, there is no cache invalidation feature in the ‘memoise’ package. In other words, if I were to change sq() to return the cube of the input, memo_sq() would not be automatically updated or alerted in any way.

sq <- function(x) {
x**3
}
sq(2)

#> [1] 8

memo_sq(2)

#> [1] 4

In fairness, caching is not the intended purpose of memoisation, but it is a practical use case, so I think it is still worth including in this article.

‘R.cache’

The documentation for ‘R.cache’ is limited, but from what I can figure out, it implements memoisation while also linking to dependencies for cache invalidation. Further, and the main distinguishing feature between this package and ‘memoise’, ‘R.cache’ memoises an expression, not just a function.

The primary function of ‘R.cache’ is evalWithMemoization(). It takes an expression to be evaluated, evaluates the expression, and stores both the created object, a in this case, and the expression itself.

suppressPackageStartupMessages(library(R.cache))
evalWithMemoization({
print("Evaluating expression.")
a <- 1
})

#> [1] "Evaluating expression."
#> [1] 1

a

#> [1] 1

Now the second time the expression is evaluated, there is no print message because the result is loaded from disk.

library(R.cache)
evalWithMemoization({
print("Evaluating expression.")
a <- 1
})

#> [1] 1

Dependencies can be declared for the memoised expression by passing one or more objects to the key parameter. For example, the object b is listed as a key for the following expression.

b <- 1
evalWithMemoization(
{
print("Evaluating expression.")
a <- 100 + b
},
key = b
)

#> [1] "Evaluating expression."
#> [1] 101

If b doesn’t change, then the expression is not re-evaluated.

evalWithMemoization(
{
print("Evaluating expression.")
a <- 100 + b
},
key = b
)

#> [1] 101

However, if b changes, then the expression is evaluated again.

b <- 2
evalWithMemoization(
{
print("Evaluating expression.")
a <- 100 + b
},
key = b
)

#> [1] "Evaluating expression."
#> [1] 102

While this package has many desirable features for caching, there are some design choices that I do not like. To begin, I am not a huge fan of this package’s API including the function naming scheme and how the keys are passed after the expression. Further, I do not like how the final result of the expression is automatically returned, I would prefer this be returned invisibly if anything. Also, I don’t like that the default location for the caching directory is /Users/admin/Library/Caches/R/R.cache, I would prefer it be a hidden directory in the project’s root directory. Finally, the evaluated expression is not invariant to stylistic changes to the expression. For instance, if the assignment arrow <- is changed to an =, the expression is re-evaluated.

evalWithMemoization({
print("Evaluating expression.")
a = 1
})

#> [1] "Evaluating expression."
#> [1] 1

For these reasons, I created the ‘mustashe’ package, demonstrated next.

‘mustashe’

I have recently described ‘mustashe’ in two previous posts (an introduction to ‘mustashe’ and ‘mustashe’ Explained), so I will keep the description here brief.

The stash() function takes a name of the stashed value, an expression to evaluate, and any dependencies.

library(mustashe)
x <- 1
stash("y", depends_on = "x", {
print("Calculating 'y'")
y <- x + 1
})

#> Updating stash.
#> [1] "Calculating 'y'"

# Value of `y`
y

#> [1] 2

Just like ‘R.cache,’ if the value of the dependency x changes, then the code is re-evaluated.

# Change the value of a dependency of `y`.
x <- 2
stash("y", depends_on = "x", {
print("Calculating 'y'")
y <- x + 1
})

#> Updating stash.
#> [1] "Calculating 'y'"

However, ‘mustashe’ handles stylistic changes to the expression better than ‘R.cache’. For instance, if the same code was instead typed by a madman, ‘mustashe’ would still not re-run the code chunk.

stash("y", depends_on = "x", {
print( "Calculating 'y'" )
y = x + 1
# Add a new comment!
})

#> Loading stashed object.

Overall, ‘mustashe’ and ‘R.cache’ are very similar, and the main differences are stylistic.

‘DataCache’

I won’t discuss the ‘DataCache’ package extensively because I personally have little use for it. It has already been explained by the author on a previous R-Blogger’s post, ‘Data Caching’, so if you are interested, I recommend reading that article. Also, it is not on CRAN nor actively maintained on GitHub. In general it is intended to periodically load data from an external source. The idea is the the data is dynamic and frequently updated. The ‘DataCache’ package sets a timer for the data and reads in the most recent version at set periods.

‘ProjectTemplate’

The ‘ProjectTemplate’ package is far more than a caching system, rather, it is a data analysis project framework. The caching system is merely a part of it. However, the entire framework must be adopted in order to use its caching system (there is a basic explanation of why in ‘mustashe’ Explained – Why not use ’ProjectTemplate’s cache() function?). For this reason, I will not provide an in depth preview of their system, but just provide the following example. (Note, the API is very similar to that used by ‘mustashe’ because it was the inspiration for that package.)

cache("foo", depends = c("a", "b"), {
x <- loaded_data$name
x <- as.character(x)
c(x[[1]], a, b)
})

Conclusion

Here are my recommendations for what caching system to use, in order of precedence:

  1. If you just want memoisation for its intended purpose (i.e. avoid repetitive calculations), use the ‘memosie’ package.
  2. If using the ‘ProjectTemplate’ framework, then use its built in caching system.
  3. If you are using an R Markdown file, then use the chunk caching feature.
  4. For all other caching needs, choose between ‘mustashe’ and ‘R.cache’ (I prefer using ‘mustashe’, but I am biased).

To leave a comment for the author, please follow the link and comment on their blog: Posts | Joshua Cook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.