How fast do the files read in?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I will demonstrate how to generate a 1,000 row and column matrix with random numbers in R, and then save it in different file formats. I will also show how to get the file size of each saved object and benchmark how long it takes to read in each file using different functions.
Generating a large matrix
To generate a 1,000 row and column matrix with random numbers, we can use the matrix() function and the runif() function in R. Here’s the code to generate the matrix:
# set seed for reproducibility set.seed(123) # number of rows/columns in matrix n <- 1000 # generate matrix with random normal values mat <- matrix(runif(n^2), nrow = n)
This code sets the random number generator seed to ensure that the same random numbers are generated every time the code is run. It then generates a vector of 1,000^2 random numbers using the runif() function, and creates a matrix with 1,000 columns using the matrix() function.
Saving the matrix in different file formats
We can save the generated matrix in different file formats using different functions in R. Here are the functions we will use for each file format:
- CSV: write.csv()
- RDS: saveRDS()
- FST: write_fst()
- Arrow: write_feather()
Here’s the code to save the matrix in each file format:
library(fst) library(arrow) # Save matrix in different file formats write.csv(mat, "matrix.csv", row.names=FALSE) saveRDS(mat, "matrix.rds") write_fst(as.data.frame(mat), "matrix.fst") write_feather(as_arrow_table(as.data.frame(mat)), "matrix.arrow")
This code saves the matrix in each file format using the corresponding function, with the file name specified as the second argument. Getting the file size of each saved object
To get the file size of each saved object, we can use the file.size() function in R. Here’s the code to get the file size of each saved object:
# Get file size of each saved object csv_size <- file.size("matrix.csv") / (1024^2) rds_size <- file.size("matrix.rds") / (1024^2) fst_size <- file.size("matrix.fst") / (1024^2) arrow_size <- file.size("matrix.arrow") / (1024^2) # Print file size in human-readable format print(paste("CSV file size in MB:", format(csv_size, units="auto")))
[1] "CSV file size in MB: 17.17339"
print(paste("RDS file size in MB:", format(rds_size, units="auto")))
[1] "RDS file size in MB: 5.079627"
print(paste("FST file size in MB:", format(fst_size, units="auto")))
[1] "FST file size in MB: 7.700841"
print(paste("Arrow file size in MB:", format(arrow_size, units="auto")))
[1] "Arrow file size in MB: 6.705355"
This code uses the file.size() function to get the file size of each object, and stores the file size of each object in a separate variable.
Finally, it prints the file size of each object in a human-readable format using the format() function with the units=“auto” argument. The units=“auto” argument automatically chooses the most appropriate unit (e.g., KB, MB, GB) based on the file size.
Benchmarking file read times
To benchmark how long it takes to read in each file, we can use the {rbenchmark}
package in R. In this example, we will compare the read times for the CSV file using four different functions: read.csv()
, read_csv()
from the {readr}
package, fread()
from the {data.table}
package, and vroom()
from the {vroom}
package. We will also benchmark the read times for the RDS file using readRDS()
, the FST file using read_fst()
, and the Arrow file using read_feather()
.
Here’s the code to benchmark the read times:
# Load rbenchmark package library(rbenchmark) library(readr) library(data.table) library(vroom) library(dplyr) n = 30 # Benchmark read times for CSV file benchmark( # CSV File "read.csv" = { a <- read.csv("matrix.csv") }, "read_csv" = { b <- read_csv("matrix.csv") }, "fread" = { c <- fread("matrix.csv") }, "vroom alltrep false" = { d <- vroom("matrix.csv") }, "vroom alltrep true" = { dd <- vroom("matrix.csv", altrep = TRUE) }, # Replications replications = n, # Columns columns = c( "test","replications","elapsed","relative","user.self","sys.self") ) |> arrange(relative)
test replications elapsed relative user.self sys.self 1 fread 30 1.35 1.000 0.90 0.20 2 vroom alltrep true 30 6.59 4.881 3.58 1.71 3 vroom alltrep false 30 6.62 4.904 3.43 1.62 4 read.csv 30 33.86 25.081 26.15 0.22 5 read_csv 30 82.39 61.030 20.39 3.47
# RDS File benchmark( # RDS File "readRDS" = { e <- readRDS("matrix.rds") }, "read_rds" = { f <- read_rds("matrix.rds") }, # Repications replications = n, # Columns columns = c( "test","replications","elapsed","relative","user.self","sys.self") ) |> arrange(relative)
test replications elapsed relative user.self sys.self 1 read_rds 30 0.95 1.000 0.74 0.01 2 readRDS 30 0.97 1.021 0.74 0.02
# FST / Arrow benchmark( # FST "read_fst" = { g <- read_fst("matrix.fst") }, # Arrow "arrow" = { h <- read_feather("matrix.arrow") }, # Replications replications = n, # Columns columns = c( "test","replications","elapsed","relative","user.self","sys.self") ) |> arrange(relative)
test replications elapsed relative user.self sys.self 1 read_fst 30 0.21 1.000 0.05 0.12 2 arrow 30 3.00 14.286 1.60 0.11
This code loads the {rbenchmark}
package, and uses the benchmark()
function to compare the read times for each file format. We specify the function to use for each file format, and set the number of replications to 10. Conclusion
In this blog post, we demonstrated how to generate a large matrix with random numbers in R, and how to save it in different file formats. We also showed how to get the file size of each saved object, and benchmarked the read times for each file format using different functions.
This example demonstrates the importance of choosing the appropriate file format and read function for your data. Depending on the size of your data and the requirements of your analysis, some file formats and functions may be more efficient than others.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.