How fast does a compressed file in Part 2
[This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Yesterday I posted on performing a benchmark on reading in a compressed .csv.gz
file of a 2,000 by 2,000 data.frame
. It was brought to my attention by someone on Mastadon (@mariviere@fediscience.org – https://fediscience.org/@mariviere) that I should also use {duckdb}
and {arrow}
so I will perform the same analysis as yesterday but I will also add in the two aforementioned packages.
Function
The functions that we will be using this time around are as follows:
fread()
– data.tableduckdb_read_csv()
– duckdbread_csv()
– readrread.csv()
– baseread.table()
– basevroom()
with altrep =FALSE
– vroomvroom()
with altrep =TRUE
– vroomoprn_csv_dataset()
– arrow
Example
Make the Data
Let’s make that dataset again:
library(R.utils) # create a 1000 x 1000 matrix of random numbers df <- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |> as.data.frame() # Make and save gzipped file write.csv(df, "df.csv") gzip( filename = "df.csv", destname = "df.csv.gz", overwrite = FALSE, remove = TRUE )
Benchmarking
Time to benchmark
library(rbenchmark) library(data.table) library(readr) library(duckdb) library(arrow) library(vroom) library(dplyr) library(DBI) n <- 30 benchmark( # Base R "read.table" = { a <- read.table( "df.csv.gz", sep = ",", colClasses = list(numeric = 1:2000) ) }, "read.csv" = { b <- read.csv( "df.csv.gz", sep = ",", colClasses = list(numeric = 1:2000) ) }, # data.table "fread" = { c <- fread( "df.csv.gz", sep = ",", colClasses = list(numeric = 1:2000) ) }, # vroom "vroom alltrep false" = { d <- vroom("df.csv.gz", delim = ",", col_types = "d") }, "vroom alltrep true" = { e <- vroom("df.csv.gz", delim = ",", altrep = TRUE, col_types = "d") }, # readr "readr" = { f <- read_csv("df.csv.gz", col_types = "d") }, # Arrow "arrow" = { g <- open_csv_dataset("df.csv.gz") }, # DuckDB "duckdb" = { con <- dbConnect(duckdb()) h <- duckdb_read_csv( conn = con, name = "df", files = "C:\\Users\\ssanders\\Documents\\GitHub\\steveondata\\posts\\rtip-2023-03-28\\df.csv.gz" ) dbDisconnect(con) }, # Replications replications = n, # Columns columns = c( "test","replications","elapsed","relative","user.self","sys.self") ) |> arrange(relative)
test replications elapsed relative user.self sys.self 1 arrow 30 3.01 1.000 5.04 0.25 2 fread 30 28.28 9.395 19.56 4.30 3 vroom alltrep false 30 31.89 10.595 26.25 10.75 4 vroom alltrep true 30 33.72 11.203 25.75 10.67 5 duckdb 30 94.09 31.259 90.70 2.77 6 readr 30 98.28 32.651 113.05 45.12 7 read.table 30 109.97 36.535 107.78 1.24 8 read.csv 30 153.79 51.093 152.44 0.56
Important note is the session info on the pc I am using to write this:
sessionInfo()
R version 4.2.3 (2023-03-15 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045) Matrix products: default locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8 [3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.utf8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.1.1 vroom_1.6.1 arrow_11.0.0.3 duckdb_0.7.1-1 [5] DBI_1.1.3 readr_2.1.4 data.table_1.14.8 rbenchmark_1.0.0 [9] R.utils_2.12.2 R.oo_1.25.0 R.methodsS3_1.8.2 loaded via a namespace (and not attached): [1] pillar_1.9.0 compiler_4.2.3 tools_4.2.3 digest_0.6.31 [5] bit_4.0.5 jsonlite_1.8.4 evaluate_0.20 lifecycle_1.0.3 [9] tibble_3.2.1 pkgconfig_2.0.3 rlang_1.1.0 cli_3.6.1 [13] rstudioapi_0.14 parallel_4.2.3 yaml_2.3.7 xfun_0.38 [17] fastmap_1.1.1 knitr_1.42 generics_0.1.3 vctrs_0.6.1 [21] htmlwidgets_1.6.2 hms_1.1.3 bit64_4.0.5 tidyselect_1.2.0 [25] glue_1.6.2 R6_2.5.1 fansi_1.0.4 rmarkdown_2.21 [29] tzdb_0.3.0 purrr_1.0.1 magrittr_2.0.3 htmltools_0.5.5 [33] assertthat_0.2.1 utf8_1.2.3 crayon_1.5.2
Sys.info() |> as.data.frame() |> tibble::rownames_to_column() |> as_tibble() |> slice(1,2,3,5)
# A tibble: 4 × 2 rowname `Sys.info()` <chr> <chr> 1 sysname Windows 2 release 10 x64 3 version build 19045 4 machine x86-64
memory.profile() |> as.data.frame()
memory.profile() NULL 1 symbol 24303 pairlist 642504 closure 11189 environment 4009 promise 22963 language 189766 special 47 builtin 701 char 2039073 logical 18866 integer 108132 double 20060 complex 5 character 160381 ... 21 any 0 list 58500 expression 5 bytecode 41555 externalptr 12382 weakref 13860 raw 10113 S4 1362
gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 3363479 179.7 5830931 311.5 5830931 311.5 Vcells 32950395 251.4 81254422 620.0 81254324 620.0
To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.