Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Handling Large Data with R
The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as ‘read.table’ , ‘write.table’ -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.
From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O.
Testing bigmemory package
Test Background & Motivation
R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.
Testing tools
- Package home: Big Memory
- Timing tool: https://www.r-bloggers.com/here%E2%80%99s-an-improved-system-time-function-for-r/
- Faster file reading function : https://www.r-bloggers.com/faster-files-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29. This function proposes a faster file parsing mechanism and implementation with respect to read.csv. The code reads blocks of data (bytes) which makes it read faster.
Test Scenario
Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.
x1 <- matrix(rnorm(10000, 1.0, 10.0), nrow=10000, ncol=10000)
ii. Write and read a large matrix using read.table and write.table.
timeit({ foo = read.csv(filepath)}) timeit({write.table(x1, file = filepath, sep = "," , eol = "\n", dec = ".", col.names = FALSE)})
iii. Write and read a large matrix using bigmemory package
timeit({big.matrix(x1,nrow = 10000, ncol = 10000, type = "double", separated = FALSE, backingfile = "BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE)})
timeit({foo <- read.big.matrix(filepath, sep = ‘,’, header = FALSE, col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = “double”, backingfile = “BigMem.bin” ,
descriptorfile = “BigMem.desc”, shared=TRUE)})
iv. Testing using my.read.lines
timeit({ foo = my.read.lines(filepath)})
Test Results
Platform: Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.
utils | Total Elapsed Time(sec) | bigmemory | Total Elapsed Time(sec) | File size on disk (.csv) | Computation Time Saved by bigmemory |
---|---|---|---|---|---|
write.table | 369.79 | big.matrix | 1.51 | 1.7GB MB | 99% |
read.csv | 313.03 | read.big.matrix | 141.50 | 1.7GB | 55% |
* my.read.lines(filepath) took 23.73 secs.
Test Discussion
The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters ‘backingfile’ and ‘descriptorfile’. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix(‘BigMem.desc’). This way several R processes can share memory objects via ‘call by reference’.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4
Advantages:
i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using ‘call by reference’
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.