Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been playing with parallelising Rcpp11 implementation of sugar. For example, we have a NumericVector
variable x
and we want to compute e.g. sqrt(exp(x)) + 2.0
. With sugar, we can do:
NumericVector y = sqrt(exp(x)) + 2.0 ;
and this does not generate useless temporaries as what R would do. This pitch has sailed already. Here are some benchmarks with a serial sugar based version:
// [[export]] NumericVector cpp_serial( NumericVector x ){ NumericVector y = sqrt(exp(x)) + 2.0 ; return y ; }
and a pure R version:
R <- function(x) sqrt(exp(x)) + 2.0
we get these results:
> x <- rnorm(1e7) > microbenchmark(R(x), cpp_serial(x), times = 10) Unit: milliseconds expr min lq median uq max neval R(x) 233.4660 233.6885 234.0395 234.5152 265.7339 10 cpp_serial(x) 155.5668 155.8760 156.4211 164.3740 164.8633 10
Nice, we got speedup. But we are still just doing a serial computation, therefore not taking advantage of available hardware concurrency.
Concurrency is part of the C++11 standard, and using threads and related facilities has never been easier. I’ve started to work on taking advantage of that for sugar and after a few hours I’m happy with it. I’ve pushed the code to the master development branch on github. It is not enabled by default because unfortunately C++11 support available on some platforms is not complete and does not support the <thread>
header. I’ll keep my ranting about windows for another … thread. Anyway to enable the feature, you need to define the RCPP11_EXPERIMENTAL_PARALLEL
macro.
A parallel version of the previous code looks like that:
#define RCPP11_EXPERIMENTAL_PARALLEL #include <Rcpp11> using namespace Rcpp11 ; // [[export]] NumericVector cpp_4threads( NumericVector x ){ NumericVector y = threads(4) >> sqrt(exp(x)) + 2.0 ; return y ; }
The syntax is not set in stone, but I like how the specification of the number of threads to use and the actual sugar expression are separated. With auto
we can split the code into two statements without losing performance.
// [[export]] NumericVector cpp_4threads( NumericVector x ){ auto fudge = sqrt(exp(x)) + 2.0 ; NumericVector y = threads(4) >> fudge ; return y ; }
With the same data as before, I get these benchmark results:
$ Rcpp11Script -v /tmp/test.cpp /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB 'file163f711c09c0.cpp' clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp11/include" -fPIC -O3 -c file163f711c09c0.cpp -o file163f711c09c0.o clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o file163f711c09c0.so file163f711c09c0.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation > x <- rnorm(1e+07) > require(microbenchmark) Loading required package: microbenchmark > R <- function(x) sqrt(exp(x)) + 2 > microbenchmark(R(x), cpp_serial(x), cpp_4threads(x), + times = 10) Unit: milliseconds expr min lq median uq max neval R(x) 233.5048 234.21418 242.78265 243.67149 246.35764 10 cpp_serial(x) 155.4508 155.57387 155.80123 164.56398 188.42263 10 cpp_4threads(x) 55.1221 55.38649 56.85528 64.20854 65.46276 10
Of course, as usual with threading, we don’t get a 4x
performance improvement just because we use 4 threads, but a 3x
is already pretty good.
Don’t just go litter your code with threads(4) >>
everywhere though, the performance gain depends on many factors such as the hardware, the data size and the individual cost of the scalar operation. If data size is too small, using threading can have a negative impact on performance, as the timings are dominated by the overhead induced by creating threads, …
For these reasons, this feature is not automatically used yet, i.e. you don’t get parallel sugar for free, you need to explicitely express it with the threads(4) >>
syntax. This might change in the future.
Gory details
For the fans of the gory details, internally, all of this is implemented by a parallel version of std::transform
:
template <typename InputIterator, typename OutputIterator, typename Function> void transform( int nthreads, InputIterator begin, InputIterator end, OutputIterator target, Function fun ){ std::vector<std::thread> workers(nthreads-1) ; R_xlen_t chunk_size = std::distance(begin, end) / nthreads ; R_xlen_t start=0; for( int i=0; i<nthreads-1; i++, start+=chunk_size){ workers[i] = std::thread( std::transform<InputIterator, OutputIterator, Function>, begin + start, begin + start + chunk_size, target + start, fun) ; } std::transform( begin + start, end, target + start, fun ) ; for( int i=0; i<nthreads-1; i++) workers[i].join() ; }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.