Site icon R-bloggers

R Shiny and DuckDB: How to Speed Up Your Shiny Apps When Working With Large Datasets

[This article was first published on Tag: r - Appsilon | Enterprise R Shiny Dashboards, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If there’s one thing with a certain downward trend, it’s got to be people’s attention span. Even Google reports that pages with a load time of 5 seconds increase their probability of bounce by 90%! And that was in 2017!

As an R Shiny developer, you must do everything in your power to stop users leaving your app and increase engagement. One way to do it is to speed up data loading and processing time. Combining R Shiny and DuckDB is an excellent way to go.

In this article, you’ll learn what happens to R when you try to load over 200 million rows of data in your Shiny app, and what are the go-to ways to make it happen in seconds, not minutes.

Looking for an alternative way to process huge amounts of data? Try R dtplyr – A dplyr-like package with a data.table backend.

Table of Contents:

DuckDB vs. Dplyr – How Much Faster is DuckDB On Real Data?

We’ve already written about DuckDB on our blog, so we won’t repeat ourselves here. Today it’s all about hands-on experience.

The goal of this section is to test how much faster DuckDB is when compared to dplyr – a de facto standard analysis tool in R.

Dataset Configuration

If you want to follow along, you’ll need to download monthly Yellow Taxi trip data from 2019 to 2023 in Parquet format. You’ll end up with 60 individual data files, all taking around 3.3 GB of disk space:

Image 1 – Taxi dataset split into multiple parquet files

The Shiny application you’ll build shortly will load the data one year (12 Parquet files) at a time. For that reason, let’s also inspect how many rows of data you’ll be working with:

Image 2 – Row counts for each dataset year

Short answer – a lot. Over 218 million in total, with 2019 alone having more than 84 million of recorded taxi rides. If even the idea of working with this amount of data in R Shiny makes your head spin, well, you’re not alone.

Required R Packages

You’ll need the following packages installed to follow along:

library(arrow)
library(dplyr)
library(lubridate)
library(duckdb)
library(stringr)

If any of these are not installed, simply run the `install.packages(“”)` command from the R console to get you going.

Benchmark Function – Dplyr

Onto the fun stuff now!

The goal of this section is to write a function that reads one year’s worth of Parquet files from disk and performs some sort of aggregation. We’ll keep things simple and only calculate the basic statistics – number of rides, average duration, distance, price, tip amount, and total distance and price – all on a monthly basis.

If you know dplyr, the aggregation shouldn’t look foreign to you. What might look relatively new is the way you need to load the data. Put simply, the data is scattered between 12 Parquet files and you need a dynamic way to get the full file paths. Once you have that, you can use the `arrow::open_dataset()` function to turn the results into an Arrow table, and then into a data frame.

Here’s the full logic for our `get_dataset_for_year_dplyr()` function:

get_dataset_for_year_dplyr <- function(folder_path, file_year) {
  time_start <- Sys.time()

  file_paths <- c()
  # Months are represented as strings in parquet file names
  months <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")

  # Iterate and append full file path to `file_paths`
  for (mnt in months) {
    file_paths <- append(
      file_paths,
      paste0(folder_path, "/yellow_tripdata_", file_year, "-", mnt, ".parquet")
    )
  }

  # Open the dataset for an entire year - 12 parquet files - and aggregate
  res <- as.data.frame(Scanner$create(open_dataset(file_paths))$ToTable()) %>%
    mutate(
      period = format(as.Date(tpep_pickup_datetime), "%Y-%m"),
      trip_duration = as.integer(tpep_dropoff_datetime - tpep_pickup_datetime)
    ) %>%
    filter(grepl(file_year, period)) %>%
    group_by(period) %>%
    summarise(
      num_rides_m = round(n() / 1000000, 2),
      avg_trip_duration = as.integer(mean(trip_duration)),
      avg_trip_distance = round(mean(trip_distance), 2),
      total_trip_distance_m = round(sum(trip_distance) / 1000000, 2),
      avg_trip_price = round(mean(total_amount), 2),
      total_trip_price_m = round(sum(total_amount) / 1000000, 2),
      avg_tip_amount = round(mean(tip_amount), 2)
    ) %>%
    arrange(period)

  time_end <- Sys.time()

  # Return both the aggregated dataset and the runtime
  return(list(
    data = res,
    runtime = as.numeric(time_end - time_start)
  ))
}

The function returns both the aggregated dataset and the time it took to read and parse the data. This second piece of information will be crucial for a comparison later.

Benchmark Function – DuckDB

Let’s now do the same but with DuckDB!

The `get_dataset_for_year_duckdb()` function accomplishes the same as the previous one, but by using DuckDB instead of dplyr. One clear benefit here is that you can use a Glob pattern to specify file locations instead of creating a list in a loop.

DuckDB allows you to write aggregations through dplyr-like functions or through SQL. We’ve opted for the latter, just to introduce some variety:

get_dataset_for_year_duckdb <- function(folder_path, file_year) {
  time_start <- Sys.time()

  # DuckDB can read files from folder by using a glob pattern
  parquet_path <- paste0(folder_path, "/yellow_tripdata_", file_year, "-*.parquet")

  # SQL statement to perform data aggregation
  # String interpolation is used to inject dynamic string parts into the query
  duck_query <- str_interp("
    select
      period,
      round(count(*) / 1000000, 2) as num_rides_m,
      cast(avg(trip_duration) as integer) as avg_trip_duration,
      round(avg(trip_distance), 2) as avg_trip_distance,
      round(sum(trip_distance) / 1000000, 2) as total_trip_distance_m,
      round(avg(total_amount), 2) as avg_trip_price,
      round(sum(total_amount) / 1000000, 2) as total_trip_price_m,
      round(avg(tip_amount), 2) as avg_tip_amount
    from (
      select
        date_part('year', tpep_pickup_datetime) as trip_year,
        strftime(tpep_pickup_datetime, '%Y-%m') as period,
        epoch(tpep_dropoff_datetime - tpep_pickup_datetime) as trip_duration,
        trip_distance,
        total_amount,
        tip_amount
      from parquet_scan('${parquet_path}')
      where trip_year = ${file_year}
    )
    group by period
    order by period
  ")

  # Run the query with DuckDB in memory
  res <- dbGetQuery(
    conn = dbConnect(duckdb()),
    statement = duck_query
  )

  time_end <- Sys.time()

  # Return both the aggregated dataset and the runtime
  return(list(
    data = res,
    runtime = as.numeric(time_end - time_start)
  ))
}

And now, let’s compare the two!

Dplyr vs. DuckDB Benchmark Results

Note: The actual runtimes you’ll get will depend on the hardware you’re running R on. For reference, we’re using a 16” M3 Pro Macbook Pro with 12 CPU cores and 36 GB of RAM. If you can feed more cores to the underlying DuckDB engine, you’ll likely be able to process the data even faster.

Running the `get_dataset_for_year_dplyr()` and `get_dataset_for_year_duckdb()` will return two values – the aggregated data and the runtime. Let’s take a peek into the data first:

Image 3 – Aggregated dataset sample

It’s a nice, condensed, and informative overview of the millions of rows that went into the calculation.

As for the runtimes, the results are interesting:

Image 4 – Dataset aggregation time in seconds

Remember the 5-second rule from the introduction? Imagine what would happen if the user had to wait almost a minute! To make things worse, you’ll probably host your Shiny app on a significantly less powerful server.

Nevertheless, DuckDB is a clear winner – being around 35x faster. Let’s now use it in R Shiny to see how that translates to application load speeds.

Bringing DuckDB to R Shiny – How to Speed Up Slow Shiny Apps

You’ll want to start by creating two files – `helpers.R` and `app.R`. In the prior, simply paste all the package imports and logic for our two data loading functions (Dplyr and DuckDB).

As for the latter, let’s get into it!

Shiny App Code

The application you’ll write now will be quite straightforward, as it will show one table and three charts. The user will be able to select a year from a dropdown menu, and the visuals will be rerendered once the data is loaded. You’ll also see the total loading time. It’s an unnecessary component for a production app, sure, but is here just to illustrate a point.

As for the external packages, you’ll want to make sure you’re comfortable with the following ones:

  • DT: A package used to visualize table data in R and R Shiny. It has built-in search and pagination functionalities that will come in handy.
  • Highcharter: Amazing package for producing interactive and animated visualizations for R and R Shiny.

The app is split into two parts – a sidebar that shows the title, filters, and the runtime in seconds, and a main panel that shows the visuals.

In the `server()` function, you’re free to change the data loading function (Dplyr or DuckDB) to see the performance differences between the two. The code snippet below uses the `get_dataset_for_year_duckdb()` function.

Anyway, here’s the full code snippet:

# Load the file with the helper functions
source("helpers.R", local = TRUE)

library(shiny)
library(DT)
library(highcharter)


ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      tags$h3("Yellow Taxi App"),
      tags$hr(),
      # User can select a year between 2019 and 2023
      selectInput(inputId = "inYear", label = "Year:", choices = c(2019, 2020, 2021, 2022, 2023), selected = 2023),
      tags$hr(),
      tags$h5("Data load time:"),
      textOutput(outputId = "outTextDuration")
    ),
    mainPanel(
      DTOutput(outputId = "outTable"),
      highchartOutput(outputId = "outChartNumRides", height = 500),
      highchartOutput(outputId = "outChartTripDistance", height = 500),
      highchartOutput(outputId = "outChartTripPrice", height = 500)
    )
  )
)


server <- function(input, output) {
  # Use either `get_dataset_for_year_dplyr` or `get_dataset_for_year_duckdb` to get the dataset and the runtime
  data <- reactive({
    get_dataset_for_year_duckdb(
      folder_path = "",
      file_year = input$inYear
    )
  })
  
  # Duration text
  output$outTextDuration <- renderText({
    paste0(round(data()$runtime, 2), " seconds.")
  })
  
  # Table
  output$outTable <- renderDT({
    datatable(
      data = data()$data,
      colnames = c(
        "Period", "Number of rides (Million)", "Avg. Ride Duration (Seconds)", "Avg. Ride Distance", 
        "Total Ride Distance", "Avg. Ride Price", "Total Ride Price (Million)", "Avg. Ride Tip Amount"
      ),
      caption = paste("Yellow Taxi rides summary statistics for", input$inYear, "."),
      filter = "top"
    )
  })
  
  # Chart - Number of rides
  output$outChartNumRides <- renderHighchart({
    hchart(data()$data, "column", hcaes(x = period, y = num_rides_m), color = "#800000", name = "#Rides (million)") |>
      hc_title(text = "Number of rides per month (millions)", align = "left") |>
      hc_xAxis(title = list(text = "Time period")) |>
      hc_yAxis(title = list(text = "Number of rides (millions)"))
  })
  
  # Chart - Trip distance
  output$outChartTripDistance <- renderHighchart({
    hchart(data()$data, "line", hcaes(x = period, y = avg_trip_distance), color = "#0198f9", name = "Distance") |>
      hc_title(text = "Average ride distance in km", align = "left") |>
      hc_xAxis(title = list(text = "Time period")) |>
      hc_yAxis(title = list(text = "Distance (km)"))
  })
  
  # Chart - Trip price
  output$outChartTripPrice <- renderHighchart({
    hchart(data()$data, "line", hcaes(x = period, y = avg_trip_price), color = "#fe6706", name = "Price") |>
      hc_title(text = "Average ride price in $", align = "left") |>
      hc_xAxis(title = list(text = "Time period")) |>
      hc_yAxis(title = list(text = "Price ($)"))
  })
}


shinyApp(ui = ui, server = server)

You can now run the app to see what it looks like:

Image 5 – Shiny application (1)

It seems like the data for 2023 loaded in around 0.5 seconds with DuckDB!

You can see the monthly statistics in a table form, or through three different charts:

Image 6 – Shiny application (2)

Overall, it’s a simple R Shiny application that could benefit from adding more content and tweaking the visuals, but we’ll leave that up to you.

Our task is to compare the runtime differences, so let’s start with dplyr.

Runtime Examination – Dplyr

Using the application version that runs the `get_dataset_for_year_dplyr()` function is a painful experience, to say the least. Cold starting the application took a painfully long 22 seconds:

Image 7 – Dplyr application version (1)

But, things get worse as the data grows in size. For example, loading the 2019 data (84.5 million rows) took almost a minute:

Image 8 – Dplyr application version (2)

That’s a full minute that your user can’t interact with the application. Maybe the wait time is worth it, but we have our doubts.

Runtime Examination – DuckDB

Switching data processing backends from dplyr to DuckDB makes a world of difference. For example, loading the largest subset (2019) takes only 1.5 seconds:

Image 9 – DuckDB application version (1)

2022, which has just under 40 million rows is done loading and parsing in under 1 second:

Image 10 – DuckDB application version (2)

We don’t have much else to say, the numbers speak for themselves.

Runtime Comparison

To remove any chance of “luck” from our runtime comparison, we’ve run the data loading process in R Shiny 5 times for both Dplyr and DuckDB and for every year. Here are the average load time numbers:

Dplyr:

  • 2019 – 56.54 seconds
  • 2020 – 17.22 seconds
  • 2021 – 19.03 seconds
  • 2022 – 26.24 seconds
  • 2023 – 25.69 seconds

DuckDB:

  • 2019 – 1.51 seconds
  • 2020 – 0.98 seconds
  • 2021 – 0.66 seconds
  • 2022 – 0.73 seconds
  • 2023 – 0.61 seconds

In other words, Dplyr takes 0.61 to 0.7 seconds on average to read and parse 1 million rows of Parquet data, while DuckDB takes only 0.016 to 0.04 seconds for the same task.

Summing up R Shiny and DuckDB

To conclude, the technology you choose to read and process data for your Shiny app matters. It’s easy to point fingers at the web framework, but in reality, that’s almost never a bottleneck. You’ve seen just which sorts of performance improvements you can get by switching from Dplyr to DuckDB. It’s not even close!

Now, to get this level of performance from DuckDB you’ll need a powerful CPU with as many cores as you can get. On the other hand, R dplyr is single-threaded, so going for a 16-core CPU configuration in deployment scenarios doesn’t make much sense.

It’s a tradeoff between a fast-running application and a happy wallet. Optimize for what you need.

Your Shiny app works, but what’s next? It’s time to make it beautiful. Read our latest blog post to find out how.

The post appeared first on appsilon.com/blog/.

To leave a comment for the author, please follow the link and comment on their blog: Tag: r - Appsilon | Enterprise R Shiny Dashboards.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version