Quick Hit: Comparison of “Whole File Reading” Methods
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(This is part 1 of n
posts using this same data; n
will likely be 2-3, and the posts are more around optimization than anything else.)
I recently had to analyze HTTP response headers (generated by a HEAD
request) from around 74,000 sites (each response stored in a text file). They look like this:
HTTP/1.1 200 OK Date: Mon, 08 Jun 2020 14:40:45 GMT Server: Apache Last-Modified: Sun, 26 Apr 2020 00:06:47 GMT ETag: "ace-ec1a0-5a4265fd413c0" Accept-Ranges: bytes Content-Length: 967072 X-Frame-Options: SAMEORIGIN Content-Type: application/x-msdownload
I do this quite a bit in R when we create new studies at work, but I’m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.
Reading in a bunch of files (each one into a string) is fairly straightforward in R since readChar()
will do the work of reading and we just wrap that in an iterator:
length(fils) ## [1] 73514 # check file size distribution summary( vapply( X = fils, FUN = file.size, FUN.VALUE = numeric(1), USE.NAMES = FALSE ) ) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 19.0 266.0 297.0 294.8 330.0 1330.0 # they're all super small system.time( vapply( X = fils, FUN = function(.f) readChar(.f, file.size(.f)), FUN.VALUE = character(1), USE.NAMES = FALSE ) -> tmp ) ## user system elapsed ## 2.754 1.716 4.475
NOTE: You can use lapply()
or sapply()
to equal effect as they all come in around 5 seconds on a modern SSD-backed system.
Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but can we do better? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:
base::readChar()
readr::read_file()
stringi::stri_read_raw()
(+rawToChar()
)
Do any of them beat {base}? Let’s see (using the largest of the files):
library(stringi) library(readr) library(microbenchmark) largest <- fils[which.max(sapply(fils, file.size))] file.size(largest) ## [1] 1330 microbenchmark( base = readChar(largest, file.size(largest)), readr = read_file(largest), stringi = rawToChar(stri_read_raw(largest)), times = 1000, control = list(warmup = 100) ) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 79.862 93.5040 98.02751 95.3840 105.0125 161.566 1000 ## readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256 1000 ## stringi 52.113 60.9690 67.17392 64.4185 74.9895 249.427 1000
I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it — for good reasons — which doesn’t help it win any performance contests).
I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like “sands” some corners, really). We’ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):
library(Rcpp) cppFunction(code = ' String cpp_read_file(std::string fil) { // our input stream std::ifstream in(fil, std::ios::in | std::ios::binary); if (in) { // we can work with the file #ifdef Win32 struct _stati64 st; // gosh i hate windows _wstati64(wfn, &st) // this shld work but I did not test it #else struct stat st; stat(fil.c_str(), &st); #endif std::string out; // where we will store the contents of the file out.reserve(st.st_size); // make string size == file size in.seekg(0, std::ios::beg); // ensure we are at the beginning in.read(&out[0], out.size()); // read in the file in.close(); return(out); } else { return(NA_STRING); // file missing or other errors returns NA } } ', includes = c( "#include <fstream>", "#include <string>", "#include <sys/stat.h>" ))
Is that going to be faster?
microbenchmark( base = readChar(largest, file.size(largest)), readr = read_file(largest), stringi = rawToChar(stri_read_raw(largest)), rcpp = cpp_read_file(largest), times = 1000, control = list(warmup = 100) ) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 80.500 91.6910 96.82752 94.3475 100.6945 295.025 1000 ## readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850 1000 ## stringi 51.959 60.8115 66.24508 63.9250 71.0765 171.644 1000 ## rcpp 15.072 18.3485 21.20275 21.0930 22.6360 62.988 1000
It sure looks like it, but let’s put it to the test:
system.time( vapply( X = fils, FUN = cpp_read_file, FUN.VALUE = character(1), USE.NAMES = FALSE ) -> tmp ) ## user system elapsed ## 0.446 1.244 1.693
I’ll take a two-second wait over a five-second wait any day!
FIN
I have a few more cases coming up where there will be 3-5x the number of (similar) files that I’ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.
The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.
If I missed your favorite file slurping function, drop a note in the comments and I’ll update the post with new benchmarks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.