Vectorized vs Devectorized
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This gist from @hadley has been on my mind for some time. This was already a follow up of this post from @johnmyleswhite. The problem was that sugar vectorised code suffered a performance penalty compared to devectorised code. This is particularly troubling me because the whole point of sugar is to have nice syntax without paying the price of the abstraction.
The code using sugar looks like this:
void rcpp_vectorised() { NumericVector a = NumericVector::create(1, 1); NumericVector b = NumericVector::create(2, 2); NumericVector x(2); for (int j = 0; j < 1e6; ++j) { x = a + b; } }
and the devectorised version, using a manual for loop:
void rcpp_devectorised() { NumericVector a = NumericVector::create(1, 1); NumericVector b = NumericVector::create(2, 2); NumericVector x(2); for (int j = 0; j < 1e6; ++j) { for (int i = 0; i < 2; ++i) { x[i] = a[i] + b[i]; } } }
So we have two NumericVector
a
and b
of length 2
that we assign into another NumericVector
of length 2
. Here are benchmark results from Hadley's gist:
microbenchmark( rcpp_vectorised(), rcpp_devectorised(), unit = "s") # Unit: seconds # expr min lq median uq max neval # rcpp_vectorised() 0.012043 0.012308 0.012453 0.012580 0.01319 100 # rcpp_devectorised() 0.000837 0.000848 0.000865 0.000887 0.00101 100
So quite a difference indeed. There were a few reasons for this difference. The first reason was that sugar needed some modernisation. In the next release of Rcpp11
, we've made progress in that direction and some sugar expression directly control how to apply themselves into their target vector. This differs from historic implementations (as found e.g. in Rcpp
) which was entirely based on operator[]
.
The relevant working code for the operator+
between two vectors is now a direct call to std::transform
:
template <typename Target> inline void apply( Target& target ) const { std::transform( sugar_begin(lhs), sugar_end(lhs), sugar_begin(rhs), target.begin(), op ) ; }
With this at end, here are some new timings of the same code, but built against Rcpp11
:
$ Rcpp11Script /tmp/sug.cpp > require(microbenchmark) Loading required package: microbenchmark > microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(), + unit = "s") Unit: seconds expr min lq median uq max neval vec 0.004440250 0.004457024 0.004464254 0.004479742 0.006363814 100 devec 0.001306806 0.001307507 0.001308996 0.001324055 0.001420102 100
The vectorised version is much better than the version in Rcpp
:
$ RcppScript /tmp/sug.cpp > require(microbenchmark) Loading required package: microbenchmark > microbenchmark(vec = rcpp_vectorised(), devec = rcpp_devectorised(), + unit = "s") Unit: seconds expr min lq median uq max neval vec 0.014118233 0.014155100 0.014201735 0.014477436 0.015247361 100 devec 0.001256794 0.001259593 0.001269401 0.001278607 0.001394016 100
But still not optimal. However, we can't go much further and I'll try to give some insights here. When we do:
for (int i = 0; i < 2; ++i) { x[i] = a[i] + b[i]; }
The number of elements to process is known and assumed equal between x
, a
and b
. But when we do:
x = a + b ;
The assignment operator first has to check that the sizes match, so we need to get the length of at least x
and a
(we'll assume a
and b
have matching sizes). That's your time difference right there. Getting the length of a vector is not a free operation. It is even quite an expensive operation in Rcpp
which delegates this to the Rf_length
function from the R api:
INLINE_FUN R_len_t length(SEXP s) { int i; switch (TYPEOF(s)) { case NILSXP: return 0; case LGLSXP: case INTSXP: case REALSXP: case CPLXSXP: case STRSXP: case CHARSXP: case VECSXP: case EXPRSXP: case RAWSXP: return LENGTH(s); case LISTSXP: case LANGSXP: case DOTSXP: i = 0; while (s != NULL && s != R_NilValue) { i++; s = CDR(s); } return i; case ENVSXP: return Rf_envlength(s); default: return 1; } }
This contains redundant code, e.g. we don't need the switch as we already know the R type of the object, that's the whole point of scoping these R objects in C++ classes. And furthermore LENGTH
itself has work to do, depending on whether there is support for long vectors it is implemented either like this:
# define SHORT_VEC_LENGTH(x) (((VECSEXP) (x))->vecsxp.length) # define LENGTH(x) (IS_LONG_VEC(x) ? R_BadLongVector(x, __FILE__, __LINE__) : SHORT_VEC_LENGTH(x))
or directly like this:
# define LENGTH(x) (((VECSEXP) (x))->vecsxp.length)
So getting the length of a vector is not a free operation. That's why we can't match with the devectorised version in this case because the timings are completely dominated by it.
Here are variations of @hadley's code but doing the measuring internally to reduce various overheads (.Call
, ...) :
#include <Rcpp.h> using namespace Rcpp ; // [[Rcpp::plugins(cpp11)]] typedef std::chrono::high_resolution_clock Clock ; typedef std::chrono::duration<double, std::micro> microseconds ; typedef Clock::time_point time_point ; // [[Rcpp::export]] double rcpp_devectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){ time_point start = Clock::now() ; for (int j = 0; j < niter; ++j) { for (int i = 0; i < n; ++i) { x[i] = a[i] + b[i]; } } return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ; } // [[Rcpp::export]] double rcpp_vectorised(NumericVector a, NumericVector b, NumericVector x, int n, int niter ){ time_point start = Clock::now() ; for (int j = 0; j < niter; ++j) { x = a + b ; } return std::chrono::duration_cast<microseconds>( Clock::now() - start ).count() ; } // [[Rcpp::export]] List rcpp_length_vec(NumericVector x, NumericVector a, int niter){ time_point start = Clock::now() ; int n = 0; for (int j = 0; j < niter; ++j) { n += x.size() + a.size() ; } time_point end = Clock::now() ; // here n is returned so that the compiler will not outsmart us and just not do anything. // which would be likely to happen otherwise return List::create( std::chrono::duration_cast<microseconds>( end - start ).count(), n ) ; }
So we have the devectorised, vectorised and a third function calling as many times .size()
as needed by the sugar implementation. We can compile this file with both Rcpp
and Rcpp11
and make our own benchmarking. An interesting thing to note is that none of these functions allocate between the two ticks, so there's no garbage collector pollution of the timings.
require(microbenchmark) env <- new.env() Rcpp::sourceCpp("/tmp/sugar.cpp", env = env) Rcpp_vec <- env[["rcpp_vectorised"]] Rcpp_devec <- env[["rcpp_devectorised"]] Rcpp_len <- env[["rcpp_length_vec"]] attributes::sourceCpp( "/tmp/sugar.cpp") Rcpp11_vec <- rcpp_vectorised Rcpp11_devec <- rcpp_devectorised Rcpp11_len <- rcpp_length_vec bench <- function(n, niter){ a <- rnorm(n) b <- rnorm(n) x <- numeric(n) do.call( rbind, lapply( list( Rcpp_vec = replicate( 100, Rcpp_vec(a,b,x,n,niter) ), Rcpp11_vec = replicate( 100, Rcpp11_vec(a,b,x,n,niter) ), Rcpp_devec = replicate( 100, Rcpp_devec(a,b,x,n,niter)), Rcpp11_devec = replicate( 100, Rcpp11_devec(a,b,x,n,niter)), Rcpp_len = replicate( 100, Rcpp_len(x,a,niter)[[1]] ), Rcpp11_len = replicate( 100, Rcpp11_len(x,a,niter)[[1]] ) ), summary ) ) }
n = 2 | n = 10 | n = 100 | n = 1000 | |
---|---|---|---|---|
vectorised (Rcpp) | 14476.33 | 27838.04 | 198466.63 | 1931035.86 |
vectorised (Rcpp11) | 4715.86 | 10003.69 | 44361.24 | 413752.18 |
devectorised (Rcpp) | 3776.75 | 7960.56 | 39628.00 | 410612.11 |
devectorised (Rcpp11) | 3775.82 | 7899.00 | 39644.37 | 410655.31 |
lenghts (Rcpp) | 8517.55 | 8513.23 | 8492.28 | 8501.91 |
lengths (Rcpp11) | 939.88 | 939.88 | 939.88 | 939.88 |
A few things form reading this table:
- For the small vectors, there is a noticable difference of performance between the vectorised and devectorised code using
Rcpp11
, but the difference is fixed so its weight becomes smaller as the size of the vector grows - Just getting the lengths of vectors is cheaper with
Rcpp11
as we use closer to the metal macro rather than the over cautiousRf_length
.
This is I think good enough. A potential source of improvement for the Rcpp11
version could be to take advantage of multithreading, either manually or by leveraging a threaded STL implementation. All we would need is a threaded version of std::transform
. I'll save this for later, especially given that we can't yet use <thread>
on windows with the current version of Rtools.
A design question
Good if you made it all the way down, you might be able to help me. At the moment, Vector::size()
delegates to SHORT_VEC_LENGTH
(I'll get to long vector support later) which is cheap but not free. An alternative would be to store the length as part of the C++ object. With this, retrieving the length would be instant, but we would have to calculate it every time we create a Vector
or update its underlying data.
I'm happy with both options. Ideas ? Comments ?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.