Site icon R-bloggers

Algorithms for unweighted sampling without replacement

[This article was first published on R on Ralf Stubner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am currently working on weighted sampling for dqrng, c.f. #72, for which also have to decide on the algorithm(s) to use for weighted sampling without replacement. Before looking at that I wanted to verify my decisions for the unweighted case.

Using the new header file dqrng_sample.h from the currently released version v0.3.1 and the ability to access the global RNG from the current development version, it is easy to write functions that make use of the three provided algorithms: Partial Fisher-Yates shuffle, rejection sampling using a hash set and rejection sampling using a bit set:

#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
// requires dqrng > v0.3.1
#include <dqrng.h>
#include <dqrng_sample.h>

// [[Rcpp::export]]
Rcpp::IntegerVector sample_shuffle(int n, int size) {
  dqrng::random_64bit_accessor rng;
  return dqrng::sample::no_replacement_shuffle<Rcpp::IntegerVector, uint32_t>
    (rng, uint32_t(n), uint32_t(size), 1);
}

// [[Rcpp::export]]
Rcpp::IntegerVector sample_hashset(int n, int size) {
  dqrng::random_64bit_accessor rng;
  using set_t = dqrng::minimal_hash_set<uint32_t>;
      
  return dqrng::sample::no_replacement_set<Rcpp::IntegerVector, uint32_t, set_t>
    (rng, uint32_t(n), uint32_t(size), 1);
}

// [[Rcpp::export]]
Rcpp::IntegerVector sample_bitset(int n, int size) {
  dqrng::random_64bit_accessor rng;
  using set_t = dqrng::minimal_bit_set;
      
  return dqrng::sample::no_replacement_set<Rcpp::IntegerVector, uint32_t, set_t>
    (rng, uint32_t(n), uint32_t(size), 1);
}

Next we can benchmark these algorithms against each other and the implementation from R itself for different population sizes n and selection ratios r:

bp <- bench::press(
  n = 10^(1:8),
  r = c(0.7, 0.5, 10^-(1:4)),
  {
    size <- ceiling(r * n)
    bench::mark(
      sample.int(n, size),
      sample_shuffle(n, size),
      sample_hashset(n, size),
      sample_bitset(n, size),
      check = FALSE,
      time_unit = "s"
    )
  }
) |> mutate(label = as.factor(attr(expression, "description")))
## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.

## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.

## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.

## Warning: Some expressions had a GC in every iteration; so filtering is
## disabled.
ggplot(bp, aes(x = n, y = median, color = label)) + 
  geom_line() + scale_x_log10() + scale_y_log10() + facet_wrap(vars(r))

We learn:

This is exactly how it is implemented in dqrng::sample<VEC, INT>, which is quite reassuring.

To leave a comment for the author, please follow the link and comment on their blog: R on Ralf Stubner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version