Tip: Optimize your Rcpp loops
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, I will show you how to optimize your Rcpp
loops so that they are 2 to 3 times faster than a standard implementation.
Context
Real data example
For this post, I will use a big.matrix
which represents genotypes for 15,283 individuals, corresponding to the number of mutations (0, 1 or 2) at 287,155 different loci. Here, I will use only the first 10,000 loci (columns).
What you need to know about the big.matrix
format:
- you can easily and quickly access matrice-like objects stored on disk,
- you can use different types of storage (I use type
char
to store each element on only 1 byte), - it is column-major ordered as standard
R
matrices, - you can access elements of a
big.matrix
usingX[i, j]
inR
, - you can access elements of a
big.matrix
usingX[j][i]
inRcpp
, - you can get a
RcppEigen
orRcppArmadillo
view of abig.matrix
(see Appendix). - for more details, go to the GitHub repo.
Peek at the data:
print(dim(X)) ## [1] 15283 10000 print(X[1:10, 1:12]) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## [1,] 2 0 2 0 2 2 2 1 2 2 2 2 ## [2,] 2 0 1 2 1 1 1 1 2 1 2 2 ## [3,] 2 0 2 2 2 2 1 1 2 1 2 2 ## [4,] 2 2 0 2 0 0 0 2 2 2 0 2 ## [5,] 2 1 2 2 2 2 2 1 2 2 2 2 ## [6,] 2 1 2 1 2 2 1 1 2 2 2 2 ## [7,] 2 0 2 0 2 2 2 0 2 1 2 2 ## [8,] 2 1 1 2 1 1 1 1 2 1 2 2 ## [9,] 2 1 2 2 2 2 2 2 2 2 2 2 ## [10,] 2 0 2 1 2 2 2 0 2 1 2 1
What I needed
I needed a fast matrix-vector multiplication between a big.matrix
and a vector. Moreover, I could not use any RcppEigen
or RcppArmadillo
multiplication because I needed some options of efficiently subsetting columns or rows in my matrix (see Appendix).
Writing this multiplication in Rcpp
is no more than two loops:
// [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod1(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j < m; j++) { for (i = 0; i < n; i++) { res[i] += macc[j][i] * x[j]; } } return res; }
One test:
y <- rnorm(ncol(X)) print(system.time( test <- prod1(X@address, y) )) ## user system elapsed ## 0.664 0.004 0.668
What comes next should be transposable to other applications and other types of data.
Unrolling optimization
While searching for optimizing my multiplication, I came across this Stack Overflow answer.
Unrolling in action:
// [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod4(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j <= m - 4; j += 4) { for (i = 0; i < n; i++) { // unrolling optimization res[i] += (x[j] * macc[j][i] + x[j+1] * macc[j+1][i]) + (x[j+2] * macc[j+2][i] + x[j+3] * macc[j+3][i]); } // The parentheses are somehow important. Try without. } for (; j < m; j++) { for (i = 0; i < n; i++) { res[i] += x[j] * macc[j][i]; } } return res; } require(microbenchmark) print(microbenchmark( PROD1 = test1 <- prod1(X@address, y), PROD4 = test2 <- prod4(X@address, y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 609.0916 612.6428 613.7418 613.3740 616.4907 617.1096 5 ## PROD4 262.2658 267.7352 267.0268 268.0026 268.0785 269.0521 5 print(all.equal(test1, test2)) ## [1] TRUE
Nice! Let’s try more. Why not using 8 or 16 rather than 4?
Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods.cpp') print(bench <- microbenchmark( PROD1 = prod1(X@address, y), PROD2 = prod2(X@address, y), PROD4 = prod4(X@address, y), PROD8 = prod8(X@address, y), PROD16 = prod16(X@address, y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 620.9375 627.9209 640.6087 631.1818 659.4236 663.5798 5 ## PROD2 407.6275 418.1752 417.1746 418.4589 419.0665 422.5451 5 ## PROD4 267.1687 271.4726 283.1928 271.9553 279.6698 325.6979 5 ## PROD8 241.5542 242.9120 255.4974 246.5218 267.7683 278.7307 5 ## PROD16 212.4335 213.5228 217.4781 217.1801 221.5119 222.7423 5 time <- summary(bench)[, "median"] step <- 2^(0:4) plot(step, time, type = "b", xaxt = "n", yaxt = "n", xlab = "size of each step") axis(side = 1, at = step) axis(side = 2, at = round(time))
Conclusion
We have seen that unrolling can dramatically improve performances on loops. Steps of size 8 or 16 are of relatively little extra gain compared to 2 or 4.
As pointed out in the SO answer, it can behave rather differently between systems. So, if it is for your personal use, use the maximum gain (try 32!), but as I want my function to be used by others in a package, I think it’s safer to choose a step of 4.
Appendix
You can do a big.matrix
-vector multiplication easily with RcppEigen
or RcppArmadillo
(see this code) but it lacks of efficient subsetting option.
Indeed, you still can’t use subsetting in Eigen
, but this will come as said in this feature request. For Armadillo
, you can but it is rather slow:
Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods2.cpp') n <- nrow(X) ind <- sort(sample(n, size = n/2)) print(microbenchmark( EIGEN = test3 <- prodEigen(X@address, y), ARMA = test4 <- prodArma(X@address, y), ARMA_SUB = test5 <- prodArmaSub(X@address, y, ind - 1), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## EIGEN 567.5607 570.1843 717.2433 572.9402 576.2028 1299.3285 5 ## ARMA 1242.3581 1263.8803 1329.1212 1264.7070 1284.5612 1590.0993 5 ## ARMA_SUB 455.1174 457.5862 466.3982 461.5883 465.9056 491.7935 5 print(all( all.equal(test3, test), all.equal(as.numeric(test4), test), all.equal(as.numeric(test5), test[ind]) )) ## [1] TRUE
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.