Performance Optimization in R: An Example Using Memoise
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You can read the original post in its original format on Rtask website by ThinkR here: Performance Optimization in R: An Example Using Memoise
Context
Whether during the (training sessions) I can provide or the (coaching) sessions I can conduct, there is often a moment when I go on a long rant about the power and usefulness of caching. How awesome it is, not so complicated to set up, and super interesting.
The last time I gave a presentation was to talk about ({targets}), but in this article, we’ll focus on a more “simple” package, yet extremely effective. We’ll talk about {memoise}, which will prove to be indispensable for your analyses or your Shiny applications.
{memoise} is a package that contains a function memoise()
, and it is in this function that the main functionality is concentrated.
memoise::memoise()
is a function that takes another function as a parameter and returns a function that will have the particularity of having a memory and not redoing the calculations for nothing. An example? Ok.
Demonstration
Let’s build calcul_long
and its “memoized” version
calcul_long <- function(x){ Sys.sleep(3) x*3 } calcul_long_mem <- memoise::memoise(calcul_long)
If I execute calcul_long
, then I should always wait 3 seconds for the result.
system.time({ calcul_long(x = 2) }) #> user system elapsed #> 0.018 0.005 3.000 system.time({ calcul_long(x = 3) }) #> user system elapsed #> 0.009 0.013 3.000 system.time({ calcul_long(x = 2) }) #> user system elapsed #> 0.017 0.008 3.000
However, the function calcul_long_mem
will only make me wait 3 seconds the first time and will directly return the result the following times (if I ask it exactly the same thing of course).
system.time({ calcul_long_mem(x = 2) }) #> user system elapsed #> 0.016 0.005 3.001 system.time({ calcul_long_mem(x = 3) }) #> user system elapsed #> 0.010 0.013 3.001 system.time({ calcul_long_mem(x = 2) }) #> user system elapsed #> 0.001 0.000 0.000 system.time({ calcul_long_mem(x = 2) }) #> user system elapsed #> 0 0 0 system.time({ calcul_long_mem(x = 2) }) #> user system elapsed #> 0.001 0.000 0.000 system.time({ calcul_long_mem(x = 3) }) #> user system elapsed #> 0 0 0
Convenient, isn’t it?
A Persistent Cache
By default, the memory is temporary, limited to your R session, which is better than nothing, but we can do better and use disk storage to make the memory persistent.
To do this properly, we’ll use rappdirs::user_cache_dir()
which will return a persistent cache folder consistently for all operating systems.
dossier_cache <- rappdirs::user_cache_dir("calcul_long_dir") cd <- cachem::cache_disk(dir = dossier_cache) calcul_long_mem2 <- memoise::memoise(calcul_long, cache = cd)
And voilà!
Now calcul_long_mem2
will store all its knowledge in a disk folder, it will be shared between R sessions, today’s and tomorrow’s.
system.time({ calcul_long_mem2(x = 2) }) #> user system elapsed #> 0.001 0.000 0.001 system.time({ calcul_long_mem2(x = 3) }) #> user system elapsed #> 0.001 0.000 0.000 system.time({ calcul_long_mem2(x = 2) }) #> user system elapsed #> 0 0 0
Operation
If we analyze a little what just happened, then we notice that in the cache folder, .rds files accumulate, which contain the function’s memory.
fs::dir_tree(dossier_cache) #> ~/.cache/calcul_long_dir #> ├── 000c2eaa9359e329edb0fc4a01a8c0c5.rds #> ├── 26a9a952a6f2b8c413604e732771d2e5.rds #> ├── 2d6da9f223e1b062c5681f2e9f942ef8.rds #> ├── 32c041e836202daa0a80ef4c7e269f15.rds #> ├── 77252f4a89517c08a4f5f89de0440705.rds #> ├── 830207659dfd7c44baa0f49d74d443db.rds #> └── 8717feae48bb1499ca487fd679167cd6.rds
Upon the first execution of a function call, a new .rds is created, while for the subsequent ones, the corresponding rds is read and returned to the user.
Implementation
Here’s an implementation example,
Let’s say we have the function ocr_pdf_
which is capable of analyzing a pdf file to return lots of information, it’s rather slow to run and goes through a paid service. Let’s see how to implement in a package an optimized approach, in order to minimize execution time.
#' extract data from pdf #' #' @param file pdf file to parse #' @param apikey token #' #' @return a list #' @noRd ocr_pdf_ <- function(file,apikey){ Sys.sleep(1) file } # memoised `ocr_pdf_` function #' #' @param memoise_cache_dir cache directory #' #' @return a list #' @export #' @importFrom cachem cache_disk #' @importFrom memoise memoise #' @importFrom rappdirs user_cache_dir #' @noRd ocr_pdf_mem <- function(memoise_cache_dir = getOption( "ocr_memoise_dir", default = rappdirs::user_cache_dir("ocr") ) ) { cd <- cachem::cache_disk(dir = memoise_cache_dir) memoise::memoise(ocr_pdf_,cache = cd) } #' extract data from pdf #' @param file pdf file to parse #' @param apikey token #' @export #' @rdname ocr_pdf ocr_pdf <- ocr_pdf_mem() # Clean `ocr_pdf_mem` cache #' #' @param memoise_cache_dir cache directory #' #' @export #' @importFrom rappdirs user_cache_dir #' @rdname ocr_pdf clean_memoise_cache_ocr <- function(memoise_cache_dir = getOption("ocr_memoise_dir", default = rappdirs::user_cache_dir("ocr"))){ message("cleaning ",memoise_cache_dir," folder") unlink(memoise_cache_dir,recursive = TRUE,force = TRUE) }
You will notice that the user can use the function ocr_pdf
and that it will use a persistent cache, and if necessary the function clean_memoise_cache_ocr
will erase this cache. A slightly “finer” user can directly call the function `ocr_pdf_mem` to play with the cache folders.
It’s up to you to be inspired for your use cases.
A More Complex Case
A little treat for the end, a bit more ambitious and exploratory, to give you some ideas.
How to use {memoise} to set up a cache system for Rmd rendering based on rmarkdown::render()
?
At first glance, this is not so trivial since the default behavior of memoise will be problematic:
- In
rmarkdown::render(input = "report.Rmd")
, the different inputs do not change as such since the source path of the Rmd will always be the same, even if its content changes. - We use the function
render()
for its “side effect”, meaning that it does not return anything directly to the user and it generates an HTML on the disk.
To make memoise
detect a change in the content of the source Rmd file, we will generate the hash of the content of this Rmd and pass it as a parameter to the function. There would be a lot to say about the concept of hash, if you are not familiar with this notion, remember that:
- Two identical files will have the same hash,
- The slightest comma added to the Rmd will change the hash,
- And it is (almost) impossible for two different objects to have the same hash.
To generate this hash, we will use the {digest} package like this digest::digest(readLines("report.Rmd"))
.
Here’s a proposed implementation:
First, we’ll make sure that the function returns the HTML content to the user (that’s what we’ll save).
rmd_to_html_ <- function(input,hash=NULL ,... ){ message("rendering") a <- rmarkdown::render(input = input,output_file =tempfile(), ...) out <- readLines(a) invisible(out) } file.copy(from = system.file("rmd","site","index.Rmd",package = "rmarkdown"),to="./a.Rmd") #> [1] FALSE a <- rmd_to_html_(input = "a.Rmd") #> rendering #> #> #> processing file: a.Rmd #> | | | 0% | |................................................................................................................| 100% #> output file: a.knit.md #> /usr/lib/rstudio-server/bin/quarto/bin/tools/pandoc +RTS -K512m -RTS a.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp1j3QTy/file3ecda11c83384d.html --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp1j3QTy/rmarkdown-str3ecda14ba5eeeb.html #> #> Output created: /tmp/Rtmp1j3QTy/file3ecda11c83384d.html # avec cat on peut refabriquer le fichier # cat(a,file = "test.html") # browseURL("test.html")
with
cat()
we can rebuild the html file
cat(a,file = "test.html") browseURL("test.html")
we implement a persistent cache
dossier_cache <- rappdirs::user_cache_dir("rmd_dir") cd <- cachem::cache_disk(dir = dossier_cache) rmd_to_html <- memoise::memoise(rmd_to_html_,cache = cd)
we create a user-friendly function
compile_rmd <- function(input, output = tempfile(fileext = ".html")) { html_content <- rmd_to_html(input = input, hash = digest::digest(readLines(input))) cat(html_content, file = output) output }
Tadaa
file.copy(from = system.file("rmd","site","index.Rmd",package = "rmarkdown"),to="./file1.Rmd") #> [1] FALSE compile_rmd(input = "file1.Rmd") #> [1] "/tmp/Rtmp1j3QTy/file3ecda12abf5b3a.html" compile_rmd(input = "file1.Rmd") # instantané #> [1] "/tmp/Rtmp1j3QTy/file3ecda11c2fd776.html" cat("\n newline \n",file = "file1.Rmd",append = TRUE) compile_rmd(input = "file1.Rmd") # le fichier a changé donc on relance la compilation #> rendering #> #> #> processing file: file1.Rmd #> | | | 0% | |................................................................................................................| 100% #> output file: file1.knit.md #> /usr/lib/rstudio-server/bin/quarto/bin/tools/pandoc +RTS -K512m -RTS file1.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp1j3QTy/file3ecda12505b85e.html --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --template /home/vincent/R/x86_64-pc-linux-gnu-library/4.2.2/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp1j3QTy/rmarkdown-str3ecda15c2c8b4c.html #> #> Output created: /tmp/Rtmp1j3QTy/file3ecda12505b85e.html #> [1] "/tmp/Rtmp1j3QTy/file3ecda148755619.html" compile_rmd(input = "file1.Rmd") # instantané #> [1] "/tmp/Rtmp1j3QTy/file3ecda167fd745d.html"
Okay, that’s fine, but what about image files that would need to be displayed in the HTML, in case we are not in standalone HTML?
We will rethink all this to manage the handling of physical dependencies, and the simplest will be to move all the files to a folder next to the HTML and adapt the source code.
Here’s a proposed implementation:
compile_rmd2 <- function(input, output = tempfile(fileext = ".html")) { html_content <- rmd_to_html(input = input, hash = digest::digest(readLines(input))) # we prepare the destination folder dossier_destination <- dirname(output) dir.create(file.path(dossier_destination, "images"),showWarnings = FALSE,recursive = TRUE) images_to_handle <- get_images_to_pimp( html_content = html_content, dossier_destination = dirname(output), folder_name = "images" ) # we move the images move_images(images_to_handle) # we edit the html html_content %>% search_and_replace_in_txt( pattern = images_to_handle %>% map_chr("source"), replacement = images_to_handle %>% map_chr("rewrite") ) %>% cat(file = output) output }
with the following tools:
#' @importFrom stringr str_replace_all search_and_replace_in_txt <- function(txt,pattern, replacement){ if (length(pattern)==0){return(txt)} motifs <- setNames(replacement,pattern) str_replace_all(string = txt, motifs) } #' @importFrom purrr map #' @importFrom stringr str_subset #' @importFrom rvest read_html html_nodes html_attr get_images_to_pimp <- function(html_content,dossier_destination=".",folder_name ="images"){ read_html(paste(html_content, sep = "", collapse = "")) %>% html_nodes("img") %>% html_attr("src") %>% stringr::str_subset(negate = TRUE, "^data") %>% unique() %>% map( ~ list( source = .x, destination = file.path(dossier_destination, folder_name, basename(.x)), rewrite = file.path(".", folder_name, basename(.x)) )) } #' @importFrom fs file_copy #' @importFrom purrr map move_images <- function(images_to_handle){ images_to_handle %>% map(~file.copy(.$source,.$destination, overwrite = TRUE)) }
That’s all for today.
Conclusion
And with all this, I hope you’re going to cache like never before!
header source image : https://alterconf.com/speakers/anjana-vakil
This post is better presented on its original ThinkR website here: Performance Optimization in R: An Example Using Memoise
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.