Site icon R-bloggers

Using the RcppArmadillo-based Implementation of R’s sample()

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview and Motivation

All of R’s (r*, p*, q*, d*) distribution functions are available in C++ via the R API. R is written in C, and the R API has no concept of a vector (at least not in the STL sense). Consequently, R’s sample() function can’t just be exported via the R API, despite its importance and usefulness. The purpose of RcppArmadilloExtensions/sample.h (written by Christian Gunning) is to provide this functionaility within C++ programs.

Given sampling’s central role in statistical programming, it is surprising that no standard library implementation for C or C++ is commonly available. There have been repeated questions about this on the Rcpp mailing list. StackExchange contains an extensive discussion of this question, but there is no “canonical” implementation.

In general, it’s best to use R’s tried-and-true RNG-related functions leaving the praise (and blame) for their performance to others. R’s C routines for sampling can be found in src/main/random.c, with a discussion of the associated algorithms in Ripley 87.

Goal

The goal is to exactly reproduce the behavior of R’s sample() function in a templated C++/Rcpp function. So far, we’ve reproduced everything but R’s implementation of the Walker Alias method (used only when sampling with replacement using >200 non-zero weights). It uses convenience functions from Armadillo, and thus is added to RcppArmadillo as an extension. (The hope is that future extensions will follow!) All you need is this simple call, which should work for any Rcpp Vector: RcppArmadillo::sample(x, size, replace, prob).

Dependencies

Make sure you have a recent version of RcppArmadillo. The earliest adequate release is 3.800.1. The usual install.packages("RcppArmadillo") will help if you need to update.

You are ready to go from there:

require(RcppArmadillo)


Loading required package: RcppArmadillo


Loading required package: Rcpp

Quick Example

Here’s a quick test to make sure it works.

Some C++ code that can be hooked into with sourceCpp():

// [[Rcpp::depends(RcppArmadillo)]]

#include <RcppArmadilloExtensions/sample.h>

using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector csample_char( CharacterVector x, 
                              int size,
                              bool replace, 
                              NumericVector prob = NumericVector::create()
                              ) {
  RNGScope scope ;
  CharacterVector ret = RcppArmadillo::sample(x, size, replace, prob) ;
  return ret ;
}

Notice that we only need #include <RcppArmadilloExtensions/sample.h> because sample.h then #include-s RcppArmadillo.

We invoke the (automatically defined) csample_char() R function:

N <- 10
set.seed(7)

sample.r <- sample(letters, N, replace=T)

set.seed(7)
sample.c <- csample_char(letters, N, replace=T)

print(identical(sample.r, sample.c))


[1] TRUE

Of course, R’s sample() function is “internally” vectorized and already fast. This functionality was not added to speed up sample()! Instead, this lets you stay in C++ when you need to sample from an Rcpp Vector, be it Numeric, Character, or whatever else.

Performance

That said, performance is still a concern. A quick test shows a dead-heat for sampling with replacement when compared to vanilla R:

#include <RcppArmadilloExtensions/sample.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp ;

// [[Rcpp::export]]
NumericVector csample_num( NumericVector x,
                           int size,
                           bool replace,
                           NumericVector prob = NumericVector::create()
                           ) {
  RNGScope scope;
  NumericVector ret = RcppArmadillo::sample(x, size, replace, prob);
  return ret;
}

Consider the following timing where we compare vanilla R’s sample() to RcppArmadillo::sample(). See the results for sampling with replacement with and without probability weights.

require(rbenchmark)


Loading required package: rbenchmark

set.seed(7)

## Definition of Sampling Frame
n.elem <- 1e2
frame1 <- rnorm(n.elem)
probs1 <- runif(n.elem)

## Definition of sampling regime
## Use replacement throughout
.replace <- TRUE
## Samplesize
n.samples1 <- 1e4

## Without probabilities
benchmark(r = sample(frame1, n.samples1, replace=.replace),        
          cpp = csample_num(frame1, n.samples1, replace=.replace),
          replications = 1e3,
          order = 'relative',
          columns = c("test", "replications", "relative", "elapsed")
          )


  test replications relative elapsed
2  cpp         1000    1.000   0.181
1    r         1000    1.182   0.214

## With probabilities
benchmark(r.prob = sample(frame1, n.samples1, prob = probs1, replace = .replace),
          cpp.prob = csample_num(frame1, n.samples1, prob = probs1, replace = .replace),
          replications = 1e3,
          order = 'relative',
          columns = c("test", "replications", "relative", "elapsed")
          )


      test replications relative elapsed
1   r.prob         1000    1.000   0.759
2 cpp.prob         1000    1.026   0.779

The two perform equally well.

Next we look at the performance of sampling without replacement. The number of draws can be no larger than the number of elements. Thus we’re sampling fewer elements. Otherwise, the code is identical.

## Use the same sampling frame as before
## Definition of sampling regime
## No replacement
.replace <- FALSE
## Since sample size can't exceed number elements, set them equal
n.samples2 <- n.elem

## Without probabilities
benchmark(r = sample(frame1, n.samples2, replace=.replace),        
          cpp = csample_num(frame1, n.samples2, replace=.replace),
          replications = 1e3,
          order = 'relative',
          columns = c("test", "replications", "relative", "elapsed")
          )


  test replications relative elapsed
2  cpp         1000    1.000   0.011
1    r         1000    1.273   0.014

## With probabilities
benchmark(r.prob = sample(frame1, n.samples2, prob = probs1, replace = .replace),
          cpp.prob = csample_num(frame1, n.samples2, prob = probs1, replace = .replace),
          replications = 1e3,
          order = 'relative',
          columns = c("test", "replications", "relative", "elapsed")
          )


      test replications relative elapsed
1   r.prob         1000        1   0.029
2 cpp.prob         1000        1   0.029

Finally, what we haven’t done. For sampling with replacement and more than 200 non-zero weights, R uses Walker’s Alias method. This method can be substantially faster than the vanilla sampling method (with replacement, less than 200 non-zero weights). Rather than risk leading users astray with inefficient and inappropriate methods, we throw an error.

## Definition of Sampling Frame
n.elem <- 1e3
frame2 <- rnorm(n.elem)
probs2 <- runif(n.elem)

## Definition of sampling regime
## Use replacement throughout
.replace <- TRUE
## Samplesize
n.samples1 <- 1e4

## With probabilities
r.prob <- sample(frame2, n.samples1, prob = probs2, replace = .replace)


Warning: Walker's alias method used: results are different from R < 2.2.0

cpp.prob <- csample_num(frame2, n.samples1, prob = probs2, replace = .replace)

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.