How fast does a compressed file in Part 2

Steven P. Sanderson II, MPH

13 hours ago

[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< section id="introduction" class="level1">

Introduction

Yesterday I posted on performing a benchmark on reading in a compressed .csv.gz file of a 2,000 by 2,000 data.frame. It was brought to my attention by someone on Mastadon (@mariviere@fediscience.org – https://fediscience.org/@mariviere) that I should also use {duckdb} and {arrow} so I will perform the same analysis as yesterday but I will also add in the two aforementioned packages.

< section id="function" class="level1">

Function

The functions that we will be using this time around are as follows:

fread() – data.table
duckdb_read_csv() – duckdb
read_csv() – readr
read.csv() – base
read.table() – base
vroom() with altrep = FALSE – vroom
vroom() with altrep = TRUE – vroom
oprn_csv_dataset() – arrow

< section id="example" class="level1">

Example

< section id="make-the-data" class="level2">

Make the Data

Let’s make that dataset again:

library(R.utils)

# create a 1000 x 1000 matrix of random numbers
df <- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |>
  as.data.frame()

# Make and save gzipped file
write.csv(df, "df.csv")
gzip(
  filename = "df.csv", 
  destname = "df.csv.gz",
  overwrite = FALSE, 
  remove = TRUE
)

< section id="benchmarking" class="level2">

Benchmarking

Time to benchmark

library(rbenchmark)
library(data.table)
library(readr)
library(duckdb)
library(arrow)
library(vroom)
library(dplyr)
library(DBI)

n <- 30

benchmark(
  # Base R
  "read.table" = {
    a <- read.table(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  "read.csv" = {
    b <- read.csv(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  
  # data.table
  "fread" = {
    c <- fread(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  
  # vroom
  "vroom alltrep false" = {
    d <- vroom("df.csv.gz", delim = ",", col_types = "d")
  },
  "vroom alltrep true" = {
    e <- vroom("df.csv.gz", delim = ",", altrep = TRUE, col_types = "d")
  },
  
  # readr
  "readr" = {
    f <- read_csv("df.csv.gz", col_types = "d")
  },
  
  # Arrow
  "arrow" = {
    g <- open_csv_dataset("df.csv.gz")
  },
  
  # DuckDB
  "duckdb" = {
    con <- dbConnect(duckdb())
    h <- duckdb_read_csv(
      conn = con,
      name = "df",
      files = "C:\\Users\\ssanders\\Documents\\GitHub\\steveondata\\posts\\rtip-2023-03-28\\df.csv.gz"
    )
    dbDisconnect(con)
  },
  
  # Replications
  replications = n,
  
  # Columns
  columns = c(
    "test","replications","elapsed","relative","user.self","sys.self")
) |>
  arrange(relative)

                 test replications elapsed relative user.self sys.self
1               arrow           30    3.01    1.000      5.04     0.25
2               fread           30   28.28    9.395     19.56     4.30
3 vroom alltrep false           30   31.89   10.595     26.25    10.75
4  vroom alltrep true           30   33.72   11.203     25.75    10.67
5              duckdb           30   94.09   31.259     90.70     2.77
6               readr           30   98.28   32.651    113.05    45.12
7          read.table           30  109.97   36.535    107.78     1.24
8            read.csv           30  153.79   51.093    152.44     0.56

Important note is the session info on the pc I am using to write this:

sessionInfo()

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.1.1       vroom_1.6.1       arrow_11.0.0.3    duckdb_0.7.1-1   
 [5] DBI_1.1.3         readr_2.1.4       data.table_1.14.8 rbenchmark_1.0.0 
 [9] R.utils_2.12.2    R.oo_1.25.0       R.methodsS3_1.8.2

loaded via a namespace (and not attached):
 [1] pillar_1.9.0      compiler_4.2.3    tools_4.2.3       digest_0.6.31    
 [5] bit_4.0.5         jsonlite_1.8.4    evaluate_0.20     lifecycle_1.0.3  
 [9] tibble_3.2.1      pkgconfig_2.0.3   rlang_1.1.0       cli_3.6.1        
[13] rstudioapi_0.14   parallel_4.2.3    yaml_2.3.7        xfun_0.38        
[17] fastmap_1.1.1     knitr_1.42        generics_0.1.3    vctrs_0.6.1      
[21] htmlwidgets_1.6.2 hms_1.1.3         bit64_4.0.5       tidyselect_1.2.0 
[25] glue_1.6.2        R6_2.5.1          fansi_1.0.4       rmarkdown_2.21   
[29] tzdb_0.3.0        purrr_1.0.1       magrittr_2.0.3    htmltools_0.5.5  
[33] assertthat_0.2.1  utf8_1.2.3        crayon_1.5.2

 Sys.info() |> 
   as.data.frame() |> 
   tibble::rownames_to_column() |> 
   as_tibble() |> 
   slice(1,2,3,5)

# A tibble: 4 × 2
  rowname `Sys.info()`
  <chr>   <chr>       
1 sysname Windows     
2 release 10 x64      
3 version build 19045 
4 machine x86-64

 memory.profile() |>
   as.data.frame()

            memory.profile()
NULL                       1
symbol                 24303
pairlist              642504
closure                11189
environment             4009
promise                22963
language              189766
special                   47
builtin                  701
char                 2039073
logical                18866
integer               108132
double                 20060
complex                    5
character             160381
...                       21
any                        0
list                   58500
expression                 5
bytecode               41555
externalptr            12382
weakref                13860
raw                    10113
S4                      1362

 gc()

           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  3363479 179.7    5830931 311.5  5830931 311.5
Vcells 32950395 251.4   81254422 620.0 81254324 620.0

To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.