Speed Up Data Analytics and Wrangling With Parquet Files
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Photo by Jake Givens on Unsplash
The Challenge
Data is increasing in value for many organizations — an asset leveraged to help make informed business decisions. Unfortunately, this sentiment has not been the norm throughout the life of most organizations, with vast amounts of data locked in legacy data management systems and designs. The majority of organizations use relational database management systems (RDBMS) like Oracle, Postgres, Microsoft SQL, or MySQL to store and manage their enterprise data. Typically these systems were designed to collect and process data quickly within a transactional data model. While these models are excellent for applications, they can pose challenges for performing business intelligence, data analytics, or predictive analysis. Many organizations are realizing their legacy systems are not sufficient for data analytics initiatives, providing an opportunity for analytics teams to present tangible options to improve their organization’s data analytics infrastructure.
Regardless if you are engineering data for others to consume for analysis, or performing the analytics, reducing the time to perform data processing is critically important. Within this post, we are going to evaluate the performance of two distinct data storage formats; row-based (CSV) and columnar (parquet); with CSV being a tried and tested standard data format used within the data analytics field, and parquet becoming a viable alternative in many data platforms.
Setup
We performed the analysis for the post on Green Shield Canada’s analytics workstation. Our workstation is a shared resource for our analytics team that is running RStudio Workbench with the following configurations:
Operating system | Ubuntu 20 |
Cores | 16 |
CPU speed | 2.30GHz |
RAM | 1TB |
Load Packages
We use the following packages throughout the post:
# Importing data library(arrow) library(readr) # Data analysis and wrangling library(dplyr) # Visualization and styling library(ggplot2) library(gt)
Data Sources
We store Green Shield Canada’s source data in a transactional data model within an Oracle database. The purpose of the transaction model within Oracle is to quickly adjudicate medical claims within Green Shield’s Advantage application, and it has been performing exceptionally well. While a transactional data model provides great performance for transactional applications, the data model design is less than optimal for data analytics uses. Green Shield Canada, like many organizations, is undergoing a significant digital transformation with a high emphasis on data analytics. During the digital transformation, an analytical data model will be developed, built from many of the source tables currently stored in Oracle database tables, with the need to perform numerous data wrangling tasks.
Within Green Shield Canada, data is sized based on the following four groups:
- x-small dataset < 1M rows (day)
- small dataset 1-10M rows (month)
- medium data 10-100M rows (year)
- large data > 100M-1B rows (decade)
The main dataset used within the analysis is Green Shield Canada’s claim history data. This dataset includes various data elements related to the transactional claims submitted by Green Shield’s clients. This dataset is critically important to the organization, providing valuable insights into how the company operates and the service provided to our customers. The following is a table with the characteristics related to the claim history dataset.
Dataset Characteristics | |||||
---|---|---|---|---|---|
Claim History Data | |||||
Dataset Size Group | Dataset Name | Number of Rows | Number of Columns | CSV File Size | Parquet File Size |
x-small | claim_history_day | 317,617 | 201 | 281.8 MB | 38.1 MB |
small | claim_history_month | 5,548,609 | 202 | 4.8 GB | 711.9 MB |
medium | claim_history_year | 66,001,292 | 201 | 57.3 GB | 7.5 GB |
large | claim_history | 408,197,137 | 201 | 351.5 GB | 45.1 GB |
The second dataset used within the analysis is Green Shield Canada’s provider data. This dataset includes various data elements related to the provider network that provides medical services for Green Shield Canada’s customers. The following is a table with the characteristics associated with the provider dataset.
Dataset Characteristics | ||||
---|---|---|---|---|
Provider Data | ||||
Dataset Name | Number of Rows | Number of Columns | CSV File Size | Parquet File Size |
provider | 1,077,046 | 18 | 146.1 MB | 31 MB |
The Solution
Green Shield Canada has decided to convert data sources used for analytics from row-based sources to a columnar format, specifically Apache Parquet.
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
— Apache Foundation
We leverage the arrow R package to convert our row-based datasets into parquet files. Parquet partitions data into smaller chunks and enables improved performance when filtering against columns that have partitions.
Parquet file formats have three main benefits for analytical usage:
- Compression: low storage consumption
- Speed: efficiently reads data in less time
- Interoperability: can be read by many different languages
Converting our datasets from row-based (CSV) to columnar (parquet) has significantly reduced the file size. The CSV files range from 4.7 to 7.8 times larger than parquet files.
We will explore computationally expensive tasks in both data engineering and data analysis processes. We will perform four specific tasks on all four of the data sizes groups (x-small, small, medium, and large) produced from our claim history dataset.
- join provider information to claim history
- processed claims volume by benefit type per time interval (i.e., day, month, and/or year)
- processed claims statistics by benefit type per time interval (i.e., day, month, and/or year)
- provider information with processed claims statistics by benefit type per time interval (i.e., day, month, and/or year)
X-Small Data
The x-small data consists of data collected on a single day in January 2021. The dataset has 317,617 rows of data.
CSV Data
The CSV file used in this section was 281.8 MB in size.
# Task 1 - join start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) left_join( read_csv("/home/data/CLAIM_HISTORY_DAY.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC"), read_csv("/home/data/PROVIDER.csv"), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) end <- Sys.time() end - start
The task took 16.006 secs to execute.
# Task 2 - group_by + count start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) read_csv("/home/data/CLAIM_HISTORY_DAY.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% group_by(BNFT_TYPE_CD) %>% count() %>% ungroup() %>% arrange(desc(n)) end <- Sys.time() end - start
The task took 10.84989 secs to execute.
# Task 3 - group_by + summarize start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) read_csv("/home/data/CLAIM_HISTORY_DAY.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% group_by(BNFT_TYPE_CD) %>% summarize(minimum_amount = min(CH_REND_AMT, na.rm = TRUE), mean_amount = mean(CH_REND_AMT, na.rm = TRUE), max_amount = max(CH_REND_AMT, na.rm = TRUE)) %>% ungroup() end <- Sys.time() end - start
The task took 11.8559 secs to execute.
# Task 4 - join + group_by + summarize start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) left_join( read_csv("/home/data/CLAIM_HISTORY_DAY.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC"), read_csv("/home/data/PROVIDER.csv"), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% group_by(benefit_description, BNFT_TYPE_CD) %>% summarize(minimum_amount = min(CH_REND_AMT, na.rm = TRUE), mean_amount = mean(CH_REND_AMT, na.rm = TRUE), max_amount = max(CH_REND_AMT, na.rm = TRUE)) %>% ungroup() end <- Sys.time() end - start
The task took 16.02928 secs to execute.
Parquet Data
The parquet file used in this section was 38.1 MB in size.
# Task 1 - join start <- Sys.time() left_join( open_dataset(source = "/home/data/CLAIM_HISTORY_DAY") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(CH_SUBM_PROVIDER_ID, BNFT_TYPE_CD, CH_REND_AMT), open_dataset(sources = "/home/data/Provider") %>% select(provider_id, provider_type, benefit_description), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% collect() end <- Sys.time() end - start
The task took 1.776429 secs to execute.
# Task 2 - group_by + count start <- Sys.time() open_dataset(source = "/home/data/CLAIM_HISTORY_DAY") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(BNFT_TYPE_CD) %>% group_by(BNFT_TYPE_CD) %>% count() %>% ungroup() %>% arrange(desc(n)) %>% collect() end <- Sys.time() end - start
The task took 0.7456837 secs to execute.
# Task 3 - group_by + summarize start <- Sys.time() open_dataset(source = "/home/data/CLAIM_HISTORY_DAY") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(BNFT_TYPE_CD, CH_REND_AMT) %>% group_by(BNFT_TYPE_CD) %>% summarize(minimum_amount = min(CH_REND_AMT, na.rm = TRUE), mean_amount = mean(CH_REND_AMT, na.rm = TRUE), max_amount = max(CH_REND_AMT, na.rm = TRUE)) %>% ungroup() %>% collect() end <- Sys.time() end - start
The task took 0.2979383 secs to execute.
# Task 4 - join + group_by + summarize start <- Sys.time() left_join( open_dataset(source = "/home/data/CLAIM_HISTORY_DAY") %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(CH_SUBM_PROVIDER_ID, BNFT_TYPE_CD, CH_REND_AMT), open_dataset(sources = "/home/data/Provider") %>% select(provider_id, provider_type, benefit_description), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% group_by(benefit_description, BNFT_TYPE_CD) %>% summarize(minimum_amount = min(CH_REND_AMT, na.rm = TRUE), mean_amount = mean(CH_REND_AMT, na.rm = TRUE), max_amount = max(CH_REND_AMT, na.rm = TRUE)) %>% ungroup() %>% collect() end <- Sys.time() end - start
The task took 1.359842 secs to execute.
Small Data
The small data consists of data collected in January 2021. The dataset has 5,548,609 rows of data.
CSV Data
The CSV file used in this section was 4.8 GB in size.
# Task 1 - join start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) left_join( read_csv("/home/data/CLAIM_HISTORY_MONTH.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC"), read_csv("/home/data/PROVIDER.csv"), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) end <- Sys.time() end - start
The task took 3.677011 mins to execute.
Parquet Data
The parquet file used in this section was 711.9 MB in size.
# Task 1 - join start <- Sys.time() left_join( open_dataset( source = "/home/data/CLAIM_HISTORY_MONTH" ) %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(CH_SUBM_PROVIDER_ID, BNFT_TYPE_CD, CH_REND_AMT), open_dataset(sources = "/home/data/Provider") %>% select(provider_id, provider_type, benefit_description), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% collect() end <- Sys.time() end - start
The task took 1.604066 secs to execute.
Medium Data
The medium data consists of data collected over 2021. The dataset has 66,001,292 rows of data.
CSV Data
The CSV file used in this section was 57.3 GB in size.
# Task 1 - join start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) left_join( read_csv("/home/data/CLAIM_HISTORY_YEAR.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC"), read_csv("/home/data/PROVIDER.csv"), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) end <- Sys.time() end - start
The task took 40.19741 mins to execute.
Parquet Data
The parquet file used in this section was 7.5 GB in size.
# Task 1 - join start <- Sys.time() left_join( open_dataset( source = "/home/data/CLAIM_HISTORY_YEAR" ) %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(CH_SUBM_PROVIDER_ID, BNFT_TYPE_CD, CH_REND_AMT), open_dataset(sources = "/home/data/Provider") %>% select(provider_id, provider_type, benefit_description), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% collect() end <- Sys.time() end - start
The task took 4.153103 secs to execute.
Large Data
The large data consists of data collected between 2014 and 2022. The dataset has 408,197,137 rows of data.
CSV Data
The CSV file used in this section was 351.5 GB in size.
# Task 1 - join start <- Sys.time() claims_columns <- cols_only(CLAIM_STATUS_TYPE_CD = col_character(), CH_SUBM_PROVIDER_ID = col_double(), BNFT_TYPE_CD = col_character(), CH_REND_AMT = col_double()) provider_columns <- cols_only(provider_id = col_double(), provider_type = col_character(), benefit_description = col_character()) left_join( read_csv("/home/data/CLAIM_HISTORY_DECADE.csv") %>% filter(CLAIM_STATUS_TYPE_CD == "PC"), read_csv("/home/data/PROVIDER.csv"), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) end <- Sys.time() end - start
The task did not complete, producing Error: std::bad_alloc.
Parquet Data
The parquet file used in this section was 45.1 GB in size.
# Task 1 - join start <- Sys.time() left_join( open_dataset( source = "/home/data/CLAIM_HISTORY_DECADE" ) %>% filter(CLAIM_STATUS_TYPE_CD == "PC") %>% select(CH_SUBM_PROVIDER_ID, BNFT_TYPE_CD, CH_REND_AMT), open_dataset(sources = "/home/data/Provider") %>% select(provider_id, provider_type, benefit_description), by = c("CH_SUBM_PROVIDER_ID" = "provider_id")) %>% collect() end <- Sys.time() end - start
The task took 16.42989 secs to execute.
Our Findings
The results from our analysis were remarkable. Converting our data from row-based to columnar in parquet format significantly improved processing time. Processes that would take tens of minutes to an hour are now possible within seconds…game changer! The parquet format is a low/no-cost solution that provides immediate analytical improvements for both our data engineering and data analytics teams.
Processing Time
CSV processing time varied from 10.85 seconds to 2,411.84 seconds (40.2 minutes), whereas parquet file processing time ranged from 0.3 seconds to 16.43 seconds for all four dataset size groups. Note that the CSV large dataset errored (Error: std::bad_alloc) and did not complete. The Error: std::bad_alloc is synomenis with out-of-memory, yes insufficient memory even with our 1TB workstation!
Improvement Factor
Not only did our processing efficiency improve across all categories of sizes of data, storage size efficiency of the same datasets is also not to be overlooked. Being able to compute common analytical querying quicker and with a smaller footprint is an unrefutable win. Optimization in both size and speed is an attainable innovation for any Data Engineer/Analyst that is quantifiable and beneficial for any organization.
The following illustrates the improvement factor (aka the number of times improvement using parquet provides over csv) for each of the four tasks, as well as the storage size improvements obtained using columnar storage.
Processing Improvements with Parquet Files | ||||
---|---|---|---|---|
Dataset Size Group | Task | CSV Processing Time (in seconds) | Parquet Processing Time (in seconds) | Parquet Improvement Factor |
x-small | join | 16.01 | 1.78 | 9 |
x-small | group_by + count | 10.85 | 0.75 | 15 |
x-small | group_by + summarize | 11.86 | 0.30 | 40 |
x-small | join + group_by + summarize | 16.03 | 1.36 | 12 |
small | join | 220.62 | 1.60 | 138 |
small | group_by + count | 189.71 | 0.30 | 629 |
small | group_by + summarize | 185.72 | 0.51 | 361 |
small | join + group_by + summarize | 206.88 | 1.13 | 184 |
medium | join | 2,411.84 | 4.15 | 581 |
medium | group_by + count | 2,332.85 | 0.84 | 2,763 |
medium | group_by + summarize | 2,264.25 | 1.01 | 2,241 |
medium | join + group_by + summarize | 2,402.06 | 3.06 | 784 |
large | join | NA | 16.43 | NA |
large | group_by + count | NA | 4.39 | NA |
large | group_by + summarize | NA | 4.44 | NA |
large | join + group_by + summarize | NA | 14.93 | NA |
CSV large dataset did not complete Producing Error: std::bad_alloc |
Closing Remarks
The time it takes to process data impacts all users, data engineers, data analytics, data scientists, decision makers, business users, and clients. Reducing processing time will improve the experience for all users along the data journey. Parquet files allow for analytical teams to reduce their analytical time significantly, be that data engineering, modelling, or data analytics. With parquet not requiring all the data to be read into memory prior to analysis, the file format provides an option for all organizations, regardless of their existing data infrastructure investment.
Analytics looks to provide value to business; many times it focuses on improving efficiencies of models or adding new technology. Sometimes we can get significant improvements that pay value to business with simple solutions, like changing data storage formats. Boring yes, but 1,500 times faster processing is super!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.