Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a post earlier this month, it seemed as though compressing a data file before reading it into R could save you some time. With some feedback from readers and further experimentation, we might need to revisit that conclusion
To recap, in our previous experiment it took 170 seconds to read a 182Mb text file into R. But if we compressed the file first, it only took 65 seconds. Apparently, the benefits of reducing the amount of disk access (by dealing with a smaller file) far outweighed the CPU time required to decompress the file for reading.
In that experiment, though, each file was only read once. If you simply repeat the read statement on the uncompressed file, you see a sudden decrease in the time required to read it:
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
165.042 1.316 165.807
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
94.248 0.934 94.673
(This was on MacOS, using the R GUI. I also tried using R from the terminal on MacOS, and also from the R GUI in Windows, using both regular R and REvolution R. There were some slight variations in the timings, but in general I got similar results.)
So what’s going on here (other than my embarrassing failure as a statistician to replicate my measurements the first time round)? One possibility is that we’re seeing the effects of disk cache: when you access data on a hard drive, most modern drives will temporarily store some of the data in high-speed memory. This makes it faster to access the file in subsequent attempts, for as long as the file data remains in the cache. But that doesn’t explain why we don’t see a similar speedup in repeated readings of the compressed file:
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
89.464 0.868 90.436
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
97.651 1.035 98.887
< color="#061D99">
< >
I’d expect the second reading to be faster if disk cache had an effect, so I don’t think disk cache is the culprit here. More revealing is the fact that the first use of read.table in any R session takes longer than subsequent ones. Reading from the gzipped file is slower than reading from the uncompressed file if it’s the first read of the session:
< color="#000000" face="'Trebuchet MS', Verdana, sans-serif">
> system.time(read.table("bigdata-compressed.txt.gz", sep=","))
user system elapsed
150.429 1.304 152.447
< >< color="#000000" face="'Trebuchet MS', Verdana, sans-serif">
> system.time(read.table("bigdata.txt", sep=","))
user system elapsed
78.717 0.986 79.773
< color="#B0130E">< color="#000000">So what’s going on here? (This was using R from the terminal under MacOS; I got similar results using the R GUI on MacOS.) I don’t have a good explanation, frankly. Maybe the additional time is required by R to load libraries or to page in the R executable (but why would it scale with the file size, then?). Note that we got the speed benefits from reading the uncompressed file second, which rules out disk cache having any significant benefits. If any one has any good explanations, I’d love to hear them.
So what file type is the fastest for reading into R? Reader Peter M. Li took a much more systematic approach to answering that question than I did, running fifty trials for compressed and uncompressed files using both read.table and scan. (We can safely assume that this level of replication nullifies any first-read or caching effects.) He also tested Stata files (an open, binary data file format that R can both read and write). Peter also tested different file sizes for each file type, with files containing one thousand, 10 thousand, 100 thousand, one million and 10 million observations. His results are summarized in the graph below, with log(file size) on the X axis and log(time to read) on the Y axis:
< >< >
So, what can we conclude from all of this? Let’s see:
- In general, compressing your text data doesn’t speed up reading it into R. If anything, it’s slower.
- The only time compressing files might be beneficial is for large files read with read.table (but not scan)
- There’s a speed penalty the first time you use read.table in an R session.
- Reading data from Stata files has significant performance benefits compared to text-based files.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.