Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently bought a new computer for home and it came with two drives, one HDD and other SSD. The later is used for the OS and the former stores all of my personal files. From all computers I had, both home and work, this is definitely the fastest. While some of the merits are due to the newer CPUS and RAM, the SSD drive can make all the difference in file operations.
My research usually deals with large files from financial markets. Being efficient in reading those files is key to my productivity. Given that, I was very curious in understanding how much I would benefit in speed when reading/writing files in my SSD drive instead of the HDD. For that, I wrote a simple function that will time a particular operation. The function will take as input the number of rows in the data (1..Inf), the type of function used to save the file (rds, csv, fst) and the type of drive (HDD or SSD). See next.
bench.fct <- function(N = 2500000, type.file = 'rds', type.hd = 'HDD') { # Function for timing read and write operations # # INPUT: N - Number of rows in dataframe to be read and write # type.file - format of output file (rds, csv, fst) # type.hd - where to save (hdd or ssd) # # OUTPUT: A dataframe with results require(tidyverse) require(fst) my.df <- data_frame(x = runif(N), char.vec = sample(letters, size = N, replace = TRUE)) path.file <- switch(type.hd, 'SSD' = '~', 'HDD' = '/mnt/HDD/') my.file <- file.path(path.file, switch (type.file, 'rds-base' = 'temp_rds.rds', 'rds-readr' = 'temp_rds.rds', 'fst' = 'temp_fst.fst', 'csv-readr' = 'temp_csv.csv', 'csv-base' = 'temp_csv.csv')) if (type.file == 'rds-base') { time.write <- system.time(saveRDS(my.df, my.file, compress = FALSE)) time.read <- system.time(readRDS(my.file)) } else if (type.file == 'rds-readr') { time.write <- system.time(write_rds(x = my.df, path = my.file, compress = 'none')) time.read <- system.time(read_rds(path = my.file )) } else if (type.file == 'fst') { time.write <- system.time(write.fst(x = my.df, path = my.file)) time.read <- system.time(read_fst(my.file)) } else if (type.file == 'csv-readr') { time.write <- system.time(write_csv(x = my.df, path = my.file)) time.read <- system.time(read_csv(file = my.file, col_types = cols(x = col_double(), char.vec = col_character()))) } else if (type.file == 'csv-base') { time.write <- system.time(write.csv(x = my.df, file = my.file)) time.read <- system.time(read.csv(file = my.file)) } # clean up file.remove(my.file) # save output df.out <- data_frame(type.file = type.file, type.hd = type.hd, N = N, type.time = c('write', 'read'), times = c(time.write[3], time.read[3])) return(df.out) }
Now that we have my function, its time to use it for all combinations between number of rows, the formats of the file and type of drive:
library(purrr) df.grid <- expand.grid(N = seq(1, 500000, by = 50000), type.file = c('rds-readr', 'rds-base', 'fst', 'csv-readr', 'csv-base'), type.hd = c('HDD', 'SSD'), stringsAsFactors = F) l.out <- pmap(list(N = df.grid$N, type.file = df.grid$type.file, type.hd = df.grid$type.hd), .f = bench.fct) df.res <- do.call(what = bind_rows, args = l.out)
Lets check the result in a nice plot:
library(ggplot2) p <- ggplot(df.res, aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)
As you can see, the csv-base
format is messing with the y axis. Let’s remove it for better visualization:
library(ggplot2) p <- ggplot(filter(df.res, !(type.file %in% c('csv-base'))), aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)
When it comes to the file format, we learn:
By far, the
fst
format is the best. It takes less time to read and write than the others. However, it’s probably unfair to compare it tocsv
andrds
as it uses the 16 cores of my computer.readr
is a great package for writing and reading csv files. You can see a large difference of time from using thebase
functions. This is likely due to the use of low level functions to write and read the text files.When using the rds format, the base function do not differ much from the
readr
functions.
As for the effect of using SSD, its clear that it DOES NOT effect the time of reading and writing. The differences between using HDD and SSD looks like noise. Seeking to provide a more robust analysis, let’s formally test this hypothesis using a simple t-test for the means:
tab <- df.res %>% group_by(type.file, type.time) %>% summarise(mean.HDD = mean(times[type.hd == 'HDD']), mean.SSD = mean(times[type.hd == 'SSD']), p.value = t.test(times[type.hd == 'SSD'], times[type.hd == 'HDD'])$p.value) print(tab) ## # A tibble: 10 x 5 ## # Groups: type.file [?] ## type.file type.time mean.HDD mean.SSD p.value ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 csv-base read 0.365 0.304 0.607 ## 2 csv-base write 0.460 0.455 0.975 ## 3 csv-readr read 0.143 0.137 0.903 ## 4 csv-readr write 0.0724 0.0714 0.964 ## 5 fst read 0.0116 0.00640 0.285 ## 6 fst write 0.0076 0.0072 0.798 ## 7 rds-base read 0.0383 0.0399 0.893 ## 8 rds-base write 0.0300 0.0298 0.982 ## 9 rds-readr read 0.038 0.0384 0.973 ## 10 rds-readr write 0.0294 0.0316 0.810
As we can see, the null hypothesis of equal means easily fails to be rejected for almost all types of files and operations at 10%. The exception was for the fst format in a reading operation. In other words, statistically, it does not make any difference in time from using SSD or HDD to read or write files in different formats.
I am very surprised by this result. Independently of the type of format, I expected a large difference as SSD drives are much faster within an OS. Am I missing something? Is this due to the OS being in the SSD? What you guys think?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.