matrixStats: Optimized subsetted matrix calculations

Posted on December 16, 2015 by Henrik Bengtsson in R bloggers | 0 Comments

[This article was first published on jottR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices. In a previous blog post, I showed that, instead of using apply(X, MARGIN=2, FUN=median), we can speed up calculations dramatically by using colMedians(X). In the most recent release (version 0.50.0), matrixStats has been extended to perform optimized calculations also on a subset of rows and/or columns specified via new arguments rows and cols, e.g. colMedians(X, cols=1:50).

It’s time to leave apply() behind.

For instance, assume we wish to find the median value of the first 50 columns of matrix X with 1,000,000 rows and 100 columns. For simplicity, assume

> X <- matrix(rnorm(1e6*100), nrow=1e6, ncol=100)

To get the median values without matrixStats, we would do

> y <- apply(X[,1:50], MARGIN=2, FUN=median)
> str(y)
 num [1:50] -0.001059 0.00059 0.001316 0.00103 0.000814 ...

As in the past, we could use matrixStats to do

> y <- colMedians(X[,1:50])

which is much faster than apply() with median().

However, both approaches require that X is subsetted before the actual calculations can be performed, i.e. the temporary object X[,1:50] is created. In this example, the size of the original matrix is ~760 MiB and the subsetted one is ~380 MiB;

> object.size(X)
800000200 bytes
> object.size(X[,1:50])
400000100 bytes

This temporary object is created by (i) R first allocating the size for it and then (ii) copying all its values over from X. After the medians have been calculated this temporary object is automatically discarded and eventually (iii) R's garbage collector will deallocate its memory. This introduces overhead in form of extra memory usage as well as processing time.

Starting with matrixStats 0.50.0, we can avoid this overhead by instead using

> y <- colMedians(X, cols=1:50)

This uses less memory, because no internal copy of X[,1:50] has to be created. Instead all calculations are performed directly on the source object X. Because of this, the latter approach of subsetting is also faster.

Bootstrapping example

Subsetted calculations occur naturally in bootstrap analysis. Assume we want to calculate the median for each column of a 100-by-10,000 matrix X where the rows are resampled with replacement 1,000 times. Without matrixStats, this can be done as

B <- 1000
Y <- matrix(NA_real_, nrow=B, ncol=ncol(X))
for (b in seq_len(B)) {
  rows <- sample(seq_len(nrow(X)), replace=TRUE)
  Y[b,] <- apply(X[rows,], MARGIN=2, FUN=median)
}

However, powered with the new matrixStats we can do

B <- 1000
Y <- matrix(NA_real_, nrow=B, ncol=ncol(X))
for (b in seq_len(B)) {
  rows <- sample(seq_len(nrow(X)), replace=TRUE)
  Y[b,] <- colMedians(X, rows=rows)
}

In the first approach, with explicit subsetting (X[rows,]), we are creating a large number of temporary objects - each of size object.size(X[rows,]) == object.size(X) - that all need to be allocated, copied and deallocated. Thus, if X is a 100-by-10,000 double matrix of size 8,000,200 bytes = 7.6 MiB we are allocating and deallocating a total of 7.5 GiB worth of RAM when using 1,000 bootstrap samples. With a million bootstrap samples, we're consuming a total of 7.3 TiB RAM. In other words, we are wasting lots of compute resources on memory allocation, copying, deallocation and garbage collection. Instead, by using the optimized subsetted calculations available in matrixStats (>= 0.50.0), which is used in the second approach, we spare the computer all that overhead.

Not only does the peak memory requirement go down by roughly a half, but the overall speedup is also substantial; using a regular notebook the above 1,000 bootstrap samples took 660 seconds (=11 minutes) to complete using apply(X[rows,]), 85 seconds (8x speedup) using colMedians(X[rows,]) and 45 seconds (15x speedup) using colMedians(X, rows=rows).

Availability

The matrixStats package can be installed on all common operating systems as

> install.packages("matrixStats")

The source code is available on GitHub.

Credits

Support for optimized calculations on subsets was implemented by Dongcan Jiang. Dongcan is a Master's student in Computer Science at Peking University and worked on this project from April to August 2015 through support by the Google Summer of Code 2015 program. This GSoC project was mentored jointly by me and Hector Corrada Bravo at University of Maryland. We would like to thank Dongcan again for this valuable addition to the package and the community. We would also like to thank Google and the R Project in GSoC for making this possible.

Any type of feedback, including bug reports, is always appreciated!

Links

CRAN package: http://cran.r-project.org/package=matrixStats
Source code and bug reports: https://github.com/HenrikBengtsson/matrixStats
Google Summer of Code (GSoC): https://developers.google.com/open-source/gsoc/
R Project in GSoC (R-GSoC): https://github.com/rstats-gsoc
matrixStats in R-GSoC 2015: https://github.com/rstats-gsoc/gsoc2015/wiki/matrixStats

PACKAGE: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)

A new release 0.13.1 of matrixStats is now on CRAN. The source code is available on GitHub. What does it do? The matrixStats package provides highly optimized functions for computing common summaries over rows and columns of matrices, e.g. rowQuantiles(). There are also functions that operate on vectors, e.g. logSumExp().…

January 26, 2015

In "R bloggers"

Introducing the fastverse: An Extensible Suite of High-Performance and Low-Dependency Packages for Statistical Computing and Data Manipulation

The fastverse is a suite of complementary high-performance packages for statistical computing and data manipulation in R. Developed independently by various people, fastverse packages jointly contribute to the objectives of: Speeding up R throug...

August 12, 2021

In "R bloggers"

matrixStats: Consistent Support for Name Attributes via GSoC Project

Author: Angelina Panagopoulou, GSoC student developer, undergraduate in the Department of Informatics & Telecommunications (DIT), University of Athens, Greece We are glad to announce recent CRAN releases of matrixStats with support for hand...

August 22, 2021

In "R bloggers"

To leave a comment for the author, please follow the link and comment on their blog: jottR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

matrixStats: Optimized subsetted matrix calculations

Bootstrapping example

Availability

Credits

Links

Related

PACKAGE: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)

Introducing the fastverse: An Extensible Suite of High-Performance and Low-Dependency Packages for Statistical Computing and Data Manipulation

matrixStats: Consistent Support for Name Attributes via GSoC Project

Bootstrapping example

Availability

Credits

Links

Related

PACKAGE: matrixStats 0.13.1 - Methods that Apply to Rows and Columns of a Matrix (and Vectors)

Introducing the fastverse: An Extensible Suite of High-Performance and Low-Dependency Packages for Statistical Computing and Data Manipulation

matrixStats: Consistent Support for Name Attributes via GSoC Project

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)