Benchmarking R-native and Plumber API Approaches for ETL Processes

Posted on August 11, 2023 by Micha Christ in R bloggers | 0 Comments

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Plumber package allows R code to be exposed as a RESTful web service through special decorator comments, introducing a layer of abstraction and ease of access from other applications and frameworks. Although the overhead of doing so is fairly low, using package functions within the R context may be more appropriate when using Plumber APIs for heavyweight Extract-Transform-Load (ETL) processes. Here, we compare both methods in terms of execution time and server capacity utilization for transferring health records from our hospital information systems to our internal data warehouse instance.

Experimental Setup
Our experimental setup encompassed a virtual machine instance equipped with 64GB RAM and an Intel(R) Xeon(R) Gold 6152 CPU clocked at 2.1GHz, incorporating 8 kernels

- - running Ubuntu 22.04 LTS
  - hosting the POSIT Workbench and Connect server and
  - employing R version v4.2.1

Both POSIT services employed R version v4.2.1 and were granted identical access permissions to the virtual machine’s computational resources.

Methodology

We distinguish between two data processing workflows: the R-native ( right) approach and the Plumber API (left) based approach. Both methodologies were assessed in the context of three operations: a) data extraction from a medical database server (MSSQL server), b) transformation of the received data to conform to the target format, and c) data transmission to our internal Data Warehouse (PostgreSQL server). To systematically evaluate the performance, the number of extracted rows was incrementally increased, by generating SQL statements of the form

SELECT TOP <NROW> * FROM TABLE

, with NROW ranging from 10,000 to 10,000,000. The fundamental distinction between the two approaches lies in the encapsulation of code for data extraction and sending

R-native: in this approach, data remains within the R-context by installing our R-packages for accessing the database servers directly in one R session. Furthermore, the required intermediate steps in terms of data transformation of the extracted data before sending solely involved renaming.

Plumber API: in this approach, all relevant functionality is exposed by our API packages through Plumber APIs. As a result, data is received and sent via REST calls, rather than being executed through direct R-function calls. Furthermore, we found it necessary to address the handling of NULL and NA values, as well as the data conversion to JSON format using jsonlite::toJSON, thus leading to more sophisticated data transformation processes.

Extracting Data
So, for extracting the data we use our database API package that provides api and request objects as well as the plumber endpoint via GET. The following code snippet demonstrates the difference in the process. Moreover, the #* @serializer unboxedJSON within the Plumber endpoint defintion.

extract_via_r <- function(database_api, database_req, nrow) {
  database_req$set_param("limit", nrow)
  ret <- database_api$make_request(database_req)
  if (ret$m_code > 299) stop("Error in R workflow")
  return(NULL)
}
extract_via_plumber <- function(url, nrow) {
  resp <- httr::GET(sprintf(url, nrow, httr::accept_json(),
                    httr::add_headers(Authorization = Sys.getenv("Service_Key")))
  body <- httr::content(resp)
  if (httr::status_code(resp) > 299) stop("Error in plumber Workflow")
  return(NULL)
}

Intermediate Steps
In our scenario, the required steps to prepare extracted data for the sending process, solely involved column renaming in the R-native approach. However, for the plumber API more actions were required, as R tibbles or dataframes can not be handlded by the REST protocoll directly:

process_r <- function(dtb_r) {
  colnames(dtb_r) <- tolower(colnames(dtb_r))
  return(NULL)
}
process_plumber <- function(dtb_plumber) {
  dtb_plumber <- lapply(dtb_plumber, function(l) {
    l[sapply(l, is.null)] <- NA
    l <- unlist(l)
    l[l == "NA"] <- NA
    l
  })
  names(dtb_plumber) <- tolower(names(dtb_plumber))
  dtb_plumber <- list(body = dtb_plumber)
  dtb_plumber <- jsonlite::toJSON(dtb_plumber)
  return(NULL)
}

Sending data
Sending data works in a similar fashion than extracting data. That is, our implemented API packages are either used directly or via Plumber.

Metrics
Both approaches are evaluated in terms of execution times, simply measured with system.time(), and maximal observed CPU load, the latter being expecially an important indicator on how how much data can be extracted and send at once. For each fixed number of row, 10 trials are being conducted and the results are being plotted by using a jittered beeswarm plot. For assessing the cpu load during the benchmark, we build a separate function that returns a new session object, within which every 10 seconds the output of NCmisc::top(CPU = FALSE) is appended to a file.

get_cpu_load <- function(interval = 10, root, name, nrow) {
  rs_cpu <- callr::r_session$new()
  rs_cpu$call(function(nrow, root, name, interval) {
    files <- list.files(root)
    n_files <- sum(stringr::str_detect(files, sprintf("%s_%s_", name, format(nrow, scientific = FALSE))))
    l <- c()
    while (TRUE) {
      ret <- NCmisc::top(CPU = FALSE)
      l <- c(l, ret$RAM$used * 1000000)
      save(l, file = sprintf("%s/%s_%s_%s.rda", root, name, format(nrow, scientific = FALSE), n_files + 1))
      Sys.sleep(interval)
    }
  }, args = list(nrow, root, name, interval))
  return(rs_cpu)
}

Results
Execution time: in the following Figure A), the data extraction process is observed to be approximately 10 times slower when utilizing the plumber API as compared to the R-native approach across all dataset sizes.

Both approaches demonstrate a linear increase in the logarithmic time scale, indicative of an exponential growth pattern in the original data domain. Similar observations are noted in B) when employing the plumber API, wherein the required data transformation results in average processing times exceeding one minute when handling a dataset of 10,000,000 rows. In contrast, the execution time for the R-native case remains negligibly small. Conversely, C) portrays the execution time of sending data, illustrating that both approaches yield comparable execution times, converging to approximately 1.7 minutes for the most extensive dataset. Once more, the execution time exhibits an exponential growth trend with an increasing number of rows.

CPU Load: in the examination of maximum observable CPU load during the stages of receiving, transformation, and sending, noteworthy differences were observed between the utilization of the Plumber API and the R-native approach.

This disparity in resource utilization was consistently apparent across all three operations and numbers of extracted rows. Throughout the trials, we observed that the CPU load remained relatively stable for datasets up to 100,000 rows (idle CPU load around 3.5 – 4.5 GB), irrespective of the chosen approach. However, a substantial increase in CPU load was evident in larger trials, particularly when utilizing the Plumber API for transmitting 10,000,000 rows. In this scenario, the CPU load exceeds almost 50% (equivalent to approximately 30GB), signifying a considerable resource demand during the data sending process.

Pros and Cons
R-native approach: The R-native approach seems to be resulting in faster and less CPU-demanding operations, which can be advantageous for heavyweighted ETL tasks. Furthermore, R-native code allows for direct access to R objects and functions, providing more flexibility for complex data manipulations. However, when changes are made to R packages in the R-native approach, it can be more challenging to propagate those changes to all dependent packages. In our setup, we would need to publish a new version of a package on our POSIT Package Manager and then re-install the new version into every relevant package. Additionally, functions are limited to use within the R environment and are not easily accessible to external applications or systems.

Plumber API: Conversely, the Plumber API encourages a modular design, with each endpoint representing a specific functionality. This can lead to better code organization, maintainability, and re-usability. Furthermore, by using the Plumber API, you can deploy R functions as a web API on various platforms, enabling integration into larger systems. Moreover, the Plumber API allows updating the functionality behind endpoints while keeping the interface (API endpoints) consistent, improving version management and decoupling between clients and server. However, the REST API introduces additional overhead due to network communication and data serialization, which lead to increased latency and CPU demand compared to the R-native approach. Furthermore, exposing an API might raise security concerns, requiring proper authentication, authorization, and input validation to protect data and the server.

Benchmarking R-native and Plumber API Approaches for ETL Processes was first posted on August 11, 2023 at 11:05 am.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Benchmarking R-native and Plumber API Approaches for ETL Processes

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)