R Code Optimization

curiouscaseofsai

11 years ago

[This article was first published on iamdata, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Handling Large Data with R

The following experiments are inspired from this excellent presentation by Ryan Rosario: http://statistics.org.il/wp-content/uploads/2010/04/Big_Memory%20V0.pdf. R presents many I/O functions to the users for reading/writing data such as ‘read.table’ , ‘write.table’ -> http://cran.r-project.org/doc/manuals/R-intro.html#Reading-data-from-files. With data growing larger by the day many new methodologies are available in order to achieve faster I/O operations.

From the presentation above, many solutions are proposed (R libraries). Here are some benchmarking results with respect to the I/O.

Testing bigmemory package

Test Background & Motivation

R Works on RAM and can cause performance issues. The bigmemory package creates a variable X <- big.martix , such that X is a pointer to the dataset that is saved in the RAM or on the hard drive. Just like in the C world, here we create an reference to the object. This allows for memory-efficient parallel analysis. The R objects (such as matrices) are stored on the RAM using pointer reference. This allows multi-tasking/parallel R process to access the memory objects.
The bigmemory package mainly uses binary file format vs the ASCII/classic way in the R utils package.

Testing tools

Package home: Big Memory
Timing tool: https://www.r-bloggers.com/here%E2%80%99s-an-improved-system-time-function-for-r/
Faster file reading function : https://www.r-bloggers.com/faster-files-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29. This function proposes a faster file parsing mechanism and implementation with respect to read.csv. The code reads blocks of data (bytes) which makes it read faster.

Test Scenario

Reading and writing a large matrix using (write.table,read.table) vs (big.matrix,read.big.matrix).
i. Create a large matrix of random double values.

x1 <- matrix(rnorm(10000, 1.0, 10.0), nrow=10000, ncol=10000)

ii. Write and read a large matrix using read.table and write.table.

timeit({ foo = read.csv(filepath)})
timeit({write.table(x1, file = filepath,  sep = "," , eol = "\n", dec = ".", col.names = FALSE)})

iii. Write and read a large matrix using bigmemory package

timeit({big.matrix(x1,nrow = 10000, ncol = 10000, type = "double", separated = FALSE,
backingfile = "BigMem.bin", descriptorfile = "BigMem.desc", shared = TRUE)})

timeit({foo <- read.big.matrix(filepath, sep = ‘,’, header = FALSE, col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = “double”, backingfile = “BigMem.bin” ,
descriptorfile = “BigMem.desc”, shared=TRUE)})

iv. Testing using my.read.lines

timeit({ foo = my.read.lines(filepath)})

Test Results

Platform: Dell Precision Desktop with Intel Core 2 Duo Quad CPU @ 2.66GHz, 7.93 RAM.

utils	Total Elapsed Time(sec)	bigmemory	Total Elapsed Time(sec)	File size on disk (.csv)	Computation Time Saved by bigmemory
write.table	369.79	big.matrix	1.51	1.7GB MB	99%
read.csv	313.03	read.big.matrix	141.50	1.7GB	55%

* my.read.lines(filepath) took 23.73 secs.

Test Discussion

The computation time results show that the bigmemory provides big gains in speed with respect to I/O operations. The values of the foo dataframe are accurate.
The read.big.matrix function creates a bin file of size 789MB. This permits storing large objects (matrices etc.) in memory (on the
RAM) using pointer objects as reference. Please see parameters ‘backingfile’ and ‘descriptorfile’. When a new R session is loaded, the user provides reference to the pointer via the description file attach.big.matrix(‘BigMem.desc’). This way several R processes can share memory objects via ‘call by reference’.
The .desc file is an S4 type object -> https://github.com/hadley/devtools/wiki/S4

Advantages:
i. Faster in computation
ii. Takes less space on the file system.
iii. Subsequent loading of the data can be achieved using ‘call by reference’

To leave a comment for the author, please follow the link and comment on their blog: iamdata.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.