How fast does a compressed file in?

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I received an email over the weekend in regards to my last post not containing the reading in of gz compressed file(s) for the benchmarking. While this was not an over site per-se it was a good reminder that people would probably be interested in seeing this as well.

Benchmarking is the process of measuring and comparing the performance of different programs, tools, or configurations in order to identify which one is the most efficient for a specific task. It is a critical step in software development that can help developers identify performance bottlenecks and improve the overall performance of their applications.

In this post I create a square matrix and then convert it to a data.frame (2,000 rows by 2,000 columns) and then saved it as a gz compressed csv file. The benchmark compares different R packages and functions, including base R, data.table, vroom, and readr, and measures their relative speeds based on the time it takes to read in the .csv.gz file.

Here are some pro’s of trying things different ways and properly benchmarking them:

  • Identify the most efficient solution: Benchmarking can help you identify the most efficient solution for a specific task. By measuring the relative speeds of different programs or tools, you can determine which one is the fastest and use it to improve the performance of your application.

  • Optimize resource utilization: Benchmarking can help you optimize resource utilization by identifying programs or tools that consume more resources than others. By choosing the most resource-efficient solution, you can reduce the cost of running your application and improve its scalability.

  • Avoid premature optimization: Benchmarking can help you avoid premature optimization by measuring the performance of different programs or tools before you start optimizing them. By identifying the slowest parts of your application, you can focus your optimization efforts on the most critical areas and avoid wasting time optimizing code that doesn’t need it.

  • Keep up with technology: Benchmarking can help you keep up with technology by comparing the performance of different tools and libraries. By staying up to date with the latest technologies, you can improve the performance of your application and stay ahead of your competitors.

  • Improve code quality: Benchmarking can help you improve the quality of your code by identifying performance bottlenecks and areas for optimization. By optimizing your code, you can improve its maintainability, reliability, and readability.

In conclusion, benchmarking is an essential tool for software developers that can help them identify the most efficient solutions for their applications. By measuring the relative speeds of different programs or tools, developers can optimize resource utilization, avoid premature optimization, keep up with technology, and improve the quality of their code.

Function

The different functions I use in the benchmarking are as follows:

Base R

  • read.csv()
  • read.table()

data.table

  • fread

vroom

  • vroom() with altrep = FALSE
  • vroom() with altrep = TRUE

readr

  • read_csv()

Example

Let’s make a 2,000 by 2,000 matrix, covert to a data.frame and then save it out as a .csv file and then convert to a .gz file.

library(R.utils)

# create a 1000 x 1000 matrix of random numbers
my_matrix <- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |>
  as.data.frame()

# Make and save gzipped file
write.csv(my_matrix, "my_matrix.csv")
gzip(filename = "my_matrix.csv", destname = "matrix.csv.gz",
     overwrite = FALSE, remove = TRUE)

Ok now that the data is written we can benchmark the read in times from various packages.

Benchmarking

library(rbenchmark)
library(data.table)
library(readr)
library(vroom)
library(dplyr)

n <- 30

benchmark(
  # Base R
  "read.table" = {
    a <- read.table("matrix.csv.gz", sep = ",")
  },
  "read.csv" = {
    b <- read.csv("matrix.csv.gz", sep = ",")
  },
  
  # data.table
  "fread" = {
    c <- fread("matrix.csv.gz", sep = ",")
  },
  
  # vroom
  "vroom alltrep false" = {
    d <- vroom("matrix.csv.gz", delim = ",")
  },
  "vroom alltrep true" = {
    e <- vroom("matrix.csv.gz", delim = ",", altrep = TRUE)
  },
  
  # readr
  "readr" = {
    f <- read_csv("matrix.csv.gz")
  },
  
  # Replications
  replications = n,
  
  # Columns
  columns = c(
    "test","replications","elapsed","relative","user.self","sys.self")
) |>
  arrange(relative)
                 test replications elapsed relative user.self sys.self
1               fread           30   19.44    1.000     13.56     1.59
2  vroom alltrep true           30   22.06    1.135     10.54     2.63
3 vroom alltrep false           30   24.75    1.273     10.22     2.84
4          read.table           30   94.34    4.853     79.02     0.64
5            read.csv           30  143.28    7.370    115.64     0.74
6               readr           30  177.61    9.136     50.37    10.05

Voila!

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)