Site icon R-bloggers

Benchmarking a SSD drive in reading and writing files with R

[This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently bought a new computer for home and it came with two drives, one HDD and other SSD. The later is used for the OS and the former for all of my files. From all computers I had, both home and work, this is definitely the fastest. While some of the merits are due to the newer CPUS and RAM, the SSD drive can make all the difference in file operations.

My research usually deals with large files from financial markets. Being efficient in reading those files is key to my productivity. Given that, I was very curious in understanding how much I would benefit in speed when reading/writing files in my SSD drive instead of the HDD. For that, I wrote a simple function that will time a particular operation. The function will take as input the number of rows in the data (1..Inf), the type of function used to save the file (rds, csv, fst) and the type of drive (HDD or SSD). See next.

bench.fct <- function(N = 2500000, type.file = 'rds', type.hd = 'HDD') {
  # Function for timing read and write operations
  #
  # INPUT: N - Number of rows in dataframe to be read and write
  #        type.file - format of output file (rds, csv, fst)
  #        type.hd - where to save (hdd or ssd)
  #
  # OUTPUT: A dataframe with results
  require(tidyverse)
  require(fst)
  
  my.df <- data_frame(x = runif(N),
                      char.vec = sample(letters, size = N, 
                                        replace = TRUE))
  
  path.file <- switch(type.hd,
                      'SSD' = '~',
                      'HDD' = '/mnt/HDD/')
  
  my.file <- file.path(path.file, 
                       switch (type.file,
                               'rds-base' = 'temp_rds.rds',
                               'rds-readr' = 'temp_rds.rds',
                               'fst' = 'temp_fst.fst',
                               'csv-readr' = 'temp_csv.csv',
                               'csv-base' = 'temp_csv.csv'))
  
  if (type.file == 'rds-base') {
    time.write <- system.time(saveRDS(my.df, my.file, compress = FALSE))
    time.read <- system.time(readRDS(my.file))
  } else if (type.file == 'rds-readr') {
    time.write <- system.time(write_rds(x = my.df, path =  my.file, compress = 'none'))
    time.read <- system.time(read_rds(path = my.file ))
  } else if (type.file == 'fst') {
    time.write <- system.time(write.fst(x = my.df, path = my.file))
    time.read <- system.time(read_fst(my.file))
  } else if (type.file == 'csv-readr') {
    time.write <- system.time(write_csv(x = my.df, path = my.file))
    time.read <- system.time(read_csv(file = my.file, col_types = cols(x = col_double(),
                                                                       char.vec = col_character())))
  } else if (type.file == 'csv-base') {
    time.write <- system.time(write.csv(x = my.df, file = my.file))
    time.read <- system.time(read.csv(file = my.file))
  }
  
  # clean up
  file.remove(my.file)
  
  # save output
  df.out <- data_frame(type.file = type.file,
                       type.hd = type.hd,
                       N = N,
                       type.time = c('write', 
                                     'read'),
                       times = c(time.write[3], 
                                 time.read[3]))
  
  return(df.out)
  
}

Now that we have my function, its time to use it for all combinations between number of rows, the formats of the file and type of drive:

library(purrr)
df.grid <- expand.grid(N = seq(1, 500000, by = 50000), 
                       type.file = c('rds-readr', 'rds-base', 'fst', 'csv-readr', 'csv-base'), 
                       type.hd = c('HDD', 'SSD'), stringsAsFactors = F)

l.out <- pmap(list(N = df.grid$N,
               type.file = df.grid$type.file,
               type.hd = df.grid$type.hd), .f = bench.fct)

df.res <- do.call(what = bind_rows, args = l.out)

Lets check the result in a nice plot:

library(ggplot2)

p <- ggplot(df.res, aes(x = N, y = times, linetype = type.hd)) + 
  geom_line() + facet_grid(type.file ~ type.time)

print(p)

As you can see, the csv-base format is messing with the y axis. Let’s remove it for better visualization:

library(ggplot2)

p <- ggplot(filter(df.res, !(type.file %in% c('csv-base'))),
            aes(x = N, y = times, linetype = type.hd)) + 
  geom_line() + facet_grid(type.file ~ type.time)

print(p)

When it comes to the file format, we learn:

As for the effect of using SSD, its clear that it DOES NOT effect the time of reading and writing. The differences between using HDD and SSD looks like noise. Seeking to provide a more robust analysis, let’s formally test this hypothesis using a simple t-test for the means:

tab <- df.res %>%
  group_by(type.file, type.time) %>%
  summarise(mean.HDD = mean(times[type.hd == 'HDD']),
            mean.SSD = mean(times[type.hd == 'SSD']),
            p.value = t.test(times[type.hd == 'SSD'],
                             times[type.hd == 'HDD'])$p.value)


print(tab)

## # A tibble: 10 x 5
## # Groups:   type.file [?]
##    type.file type.time mean.HDD mean.SSD p.value
##    <chr>     <chr>        <dbl>    <dbl>   <dbl>
##  1 csv-base  read       0.554    0.463    0.605 
##  2 csv-base  write      0.405    0.405    0.997 
##  3 csv-readr read       0.142    0.126    0.687 
##  4 csv-readr write      0.0711   0.0706   0.982 
##  5 fst       read       0.015    0.0084   0.0584
##  6 fst       write      0.00900  0.00910  0.964 
##  7 rds-base  read       0.0321   0.0303   0.848 
##  8 rds-base  write      0.0253   0.025    0.969 
##  9 rds-readr read       0.0323   0.0304   0.845 
## 10 rds-readr write      0.0251   0.0247   0.957

As we can see, the null hypothesis of equal means easily fails to be rejected for almost all types of files and operations at 10%. The exception was for the fst format in a reading operation. In other words, statistically, it does not make any difference in time from using SSD or HDD to read or write files in different formats.

I am very surprised by this result. Independently of the type of format, I expected a large difference as SSD drives are much faster within an OS. Am I missing something? Is this due to the OS being in the SSD? What you guys think?

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.