Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< video width="450" controls> < source src="https://www.brodrigues.co/img/deja_vu.mp4" type="video/mp4">
This blog post is taken from a chapter of my ebook on building reproducible analytical pipelines, which you can read here
If you want to follow along, you can start by downloading the data I use here. This is a smaller dataset made from the one you can get here.
Uncompressed it’ll be a 2.4GB file. Not big data in any sense, but big enough to be annoying to handle without the use of some optimization strategies (I’ve seen such data described as medium sized data before.).
One such strategy is only letting the computations run once the user gives the green light by
clicking on an action button. The next obvious strategy is to use packages that are optimized for
speed. It turns out that the functions we have seen until now (note from the author: the functions
we have seen until now if you’re on of my students that’s sitting in the course where I teach
this), from packages like {dplyr}
and the like, are not the fastest. Their ease of use and
expressiveness come at a speed cost. So we will need to switch to something faster. We will do the
same to read in the data.
This faster solution is the {arrow}
package, which is an interface to the
Arrow software developed by Apache.
The final strategy is to enable caching in the app.
So first, install the {arrow}
package by running install.packages("arrow")
. This will compile
libarrow
from source on Linux and might take some time, so perhaps go grab a coffee. One other
operating systems, I guess that a binary version gets installed.
Before building the app, let me perform a very simple benchmark. The script below reads in the data,
then performs some aggregations. This is done using standard {tidyverse}
functions, but also
using {arrow}
:
start_tidy <- Sys.time() # {vroom} is able to read in larger files than {readr} # I could not get this file into R using readr::read_csv # my RAM would get maxed out air <- vroom::vroom("data/combined") mean_dep_delay <- air |> dplyr::group_by(Year, Month, DayofMonth) |> dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE)) end_tidy <- Sys.time() time_tidy <- end_tidy - start_tidy start_arrow <- Sys.time() air <- arrow::open_dataset("data/combined", format = "csv") mean_dep_delay <- air |> dplyr::group_by(Year, Month, DayofMonth) |> dplyr::summarise(mean_delay = mean(DepDelay, na.rm = TRUE)) end_arrow <- Sys.time() end_tidy - start_tidy end_arrow - start_arrow
The “tidy” approach took 17 seconds, while the arrow approach took 6 seconds. This is an impressive improvement, but put yourself in the shoes of a user who has to wait 6 seconds for each query. That would get very annoying, very quickly. So the other strategy that we will use is to provide some visual cue that computations are running, and then we will go one step further and use caching of results in the Shiny app.
But before we continue, you may be confused by the code above. After all, I told you before that
functions from {dplyr}
and the like were not the fastest, and yet, I am using them in the arrow
approach as well, and they now run almost 3 times as fast. What’s going on? What’s happening here,
is that the air
object that we read using arrow::open_dataset
is not a dataframe, but an arrow
dataset. These are special, and work in a different way. But that’s not what’s important: what’s important
is that the {dplyr}
api can be used to work with these arrow
datasets. This means that functions
from {dplyr}
change the way they work depending on the type of the object their dealing with.
If it’s a good old regular data frame, some C++ code gets called to perform the computations. If it’s
an arrow
dataset, libarrow
and its black magic get called instead to perform the computations.
If you’re familiar with the concept of
polymorphism this is it
(think of +
in Python: 1+1
returns 2
, "a"+"b"
returns "a+b"
. A different computation
gets performed depending on the type of the function’s inputs).
Let’s now build a basic version of the app, only
using {arrow}
functions for speed. This is the global file:
library(arrow) library(dplyr) library(rlang) library(DT) air <- arrow::open_dataset("data/combined", format = "csv")
The ui will be quite simple:
ui <- function(request){ fluidPage( titlePanel("Air On Time data"), sidebarLayout( sidebarPanel( selectizeInput("group_by_selected", "Variables to group by:", choices = c("Year", "Month", "DayofMonth", "Origin", "Dest"), multiple = TRUE, selected = c("Year", "Month"), options = list( plugins = list("remove_button"), create = TRUE, persist = FALSE # keep created choices in dropdown ) ), hr(), selectizeInput("var_to_average", "Select variable to average by groups:", choices = c("ArrDelay", "DepDelay", "Distance"), multiple = FALSE, selected = "DepDelay", ), hr(), actionButton(inputId = "run_aggregation", label = "Click here to run aggregation"), hr(), bookmarkButton() ), mainPanel( DTOutput("result") ) ) ) }
And finally the server:
server <- function(session, input, output) { # Numbers get crunched only when the user clicks on the action button grouped_data <- eventReactive(input$run_aggregation, { air %>% group_by(!!!syms(input$group_by_selected)) %>% summarise(result = mean(!!sym(input$var_to_average), na.rm = TRUE)) %>% as.data.frame() }) output$result <- renderDT({ grouped_data() }) }
Because group_by()
and mean()
expect bare variable names, I convert them from strings to
symbols using rlang::syms()
and rlang::sym()
. The difference between the two is that
rlang::syms()
is required when a list of strings gets passed down to the function (remember
that the user must select several variables to group by), and this is also why !!!
are needed
(to unquote the list of symbols). Finally, the computed data must be converted back to a
data frame using as.data.frame()
. This is actually when the computations happen. {arrow}
collects
all the aggregations but does not perform anything until absolutely required. Let’s see the app
in action:
As you can see, in terms of User Experience (UX) this is quite poor. When the user clicks on the button nothing seems to be going on for several seconds, until the table appears. Then, when the user changes some options and clicks again on the action button, it looks like the app is crashing.
Let’s add some visual cues to indicate to the user that something is happening when the button gets
clicked. For this, we are going to use the {shinycssloaders}
package:
install.packages("shinycssloaders")
and simply change the ui to this (and don’t forget to load {shinycssloaders}
in the global script!):
ui <- function(request){ fluidPage( titlePanel("Air On Time data"), sidebarLayout( sidebarPanel( selectizeInput("group_by_selected", "Variables to group by:", choices = c("Year", "Month", "DayofMonth", "Origin", "Dest"), multiple = TRUE, selected = c("Year", "Month"), options = list( plugins = list("remove_button"), create = TRUE, persist = FALSE # keep created choices in dropdown ) ), hr(), selectizeInput("var_to_average", "Select variable to average by groups:", choices = c("ArrDelay", "DepDelay", "Distance"), multiple = FALSE, selected = "DepDelay", ), hr(), actionButton(inputId = "run_aggregation", label = "Click here to run aggregation"), hr(), bookmarkButton() ), mainPanel( # We add a tabsetPanel with two tabs. The first tab show the plot made using ggplot # the second tab shows the plot using g2r DTOutput("result") |> withSpinner() ) ) ) }
The only difference with before is that now the DTOutput()
right at the end gets passed down
to withSpinner()
. There are several spinners that you can choose, but let’s simply use the
default one. This is how the app looks now:
Now the user gets a visual cue that something is happening. This makes waiting more bearable,
but even better than waiting with a spinner is no waiting at all. For this, we are going to enable caching
of results. There are several ways that you can cache results inside your app. You can enable
the cache on a per-user and per-session basis, or only on a per-user basis. But I think that
in our case here, the ideal caching strategy is to keep the cache persistent, and available
across sessions. This means that each computation done by any user will get cached and available
to any other user. In order to achieve this, you simply have to install the {cachem}
packages
add the following lines to the global script:
shinyOptions(cache = cachem::cache_disk("./app-cache", max_age = Inf))
By setting the max_age
argument to Inf
, the cache will never get pruned. The maximum size
of the cache, by default is 1GB. You can of course increase it.
Now, you must also edit the server file like so:
server <- function(session, input, output) { # Numbers get crunched only when the user clicks on the action button grouped_data <- reactive({ air %>% group_by(!!!syms(input$group_by_selected)) %>% summarise(result = mean(!!sym(input$var_to_average), na.rm = TRUE)) %>% as.data.frame() }) %>% bindCache(input$group_by_selected, input$var_to_average) %>% bindEvent(input$run_aggregation) output$result <- renderDT({ grouped_data() }) }
We’ve had to change eventReactive()
to reactive()
, just like in the app where we don’t use an
action button to run computations (note of the author: in the ebook, there is an example of an app
with this action button. This is what I’m referring to here). Then, we pass the reactive object to
bindCache()
. bindCache()
also takes the inputs
as arguments. These are used to generate cache
keys to retrieve the correct objects from cache. Finally, we pass all this to bindEvent()
. This
function takes the input referencing the action button. This is how we can now bind the
computations to the button once again. Let’s test our app now. You will notice that the first time
we choose certain options, the computations will take time, as before. But if we perform the same
computations again, then the results will be shown instantly:
As you can see, once I go back to a computation that was done in the past, the table appears
instantly. At the end of the video I open a terminal and navigate to the directory of the app,
and show you the cache. There are several .Rds
objects, these are the final data frames that
get computed by the app. If the user wants to rerun a previous computation, the correct data frame
gets retrieved, making it look like the computation happened instantly, and with another added
benefit: as discussed above, the cache is persistent between sessions, so even if the user
closes the browser and comes back later, the cache is still there, and other users will also
benefit from the cache.
Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!