Roulette-wheel selection for dqrng (part 2)

Posted on September 9, 2023 by R on Ralf Stubner in R bloggers | 0 Comments

[This article was first published on R on Ralf Stubner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have blogged about weighted sampling before. There I found that the stochastic acceptance method suggested by Lipowski and Lipowska (2012) (also at https://arxiv.org/abs/1109.3627) is very promising:

// [[Rcpp::depends(dqrng,BH,sitmo)]]
#include <Rcpp.h>
#include <dqrng_distribution.h>

auto rng = dqrng::generator<>(42);

// [[Rcpp::export]]
Rcpp::IntegerVector sample_prob(int size, Rcpp::NumericVector prob) {
  Rcpp::IntegerVector result(Rcpp::no_init(size));
  double max_prob = Rcpp::max(prob);
  uint32_t n(prob.length());
  std::generate(result.begin(), result.end(),
                [n, prob, max_prob] () {
                  while (true) {
                    int index = (*rng)(n);
                    if (dqrng::uniform01((*rng)()) < prob[index] / max_prob)
                      return index + 1;
                  }
                });
  return result;
}

For relatively even weight distributions, as created by runif(n) or sample(n), performance is good, especially for larger populations:

sample_R <- function (size, prob) {
  sample.int(length(prob), size, replace = TRUE, prob)
}

size <- 1e4
prob10 <- sample(10)
prob100 <- sample(100)
prob1000 <- sample(1000)
bm <- bench::mark(
  sample_R(size, prob10),
  sample_prob(size, prob10),
  sample_R(size, prob100),
  sample_prob(size, prob100),
  sample_R(size, prob1000),
  sample_prob(size, prob1000),
  check = FALSE
)
knitr::kable(bm[, 1:6])

expression	min	median	itr/sec	mem_alloc	gc/sec
sample_R(size, prob10)	228µs	251µs	3936.183	41.6KB	2.015455
sample_prob(size, prob10)	218µs	229µs	4252.202	41.6KB	4.057445
sample_R(size, prob100)	410µs	431µs	2248.562	63.5KB	2.018458
sample_prob(size, prob100)	245µs	254µs	3830.093	47.6KB	2.015839
sample_R(size, prob1000)	481µs	487µs	1975.758	53.4KB	2.016079
sample_prob(size, prob1000)	253µs	260µs	3726.544	49.5KB	4.070501

The nice performance breaks down when an uneven weight distribution is used. Here the largest element n is replaced by n * n, severely deteriorating the performance of the stochastic acceptance method:

size <- 1e4
prob10 <- sample(10)
prob10[which.max(prob10)] <- 10 * 10
prob100 <- sample(100)
prob100[which.max(prob100)] <- 100 * 100
prob1000 <- sample(1000)
prob1000[which.max(prob1000)] <- 1000 * 1000
bm <- bench::mark(
  sample_R(size, prob10),
  sample_prob(size, prob10),
  sample_R(size, prob100),
  sample_prob(size, prob100),
  sample_R(size, prob1000),
  sample_prob(size, prob1000),
  check = FALSE
)
knitr::kable(bm[, 1:6])

expression	min	median	itr/sec	mem_alloc	gc/sec
sample_R(size, prob10)	160.21µs	166.64µs	5763.84477	41.6KB	4.214877
sample_prob(size, prob10)	502.18µs	513.8µs	1881.10791	41.6KB	2.014034
sample_R(size, prob100)	237.97µs	247.06µs	3876.39289	42.9KB	4.061176
sample_prob(size, prob100)	3.51ms	3.63ms	271.02831	41.6KB	0.000000
sample_R(size, prob1000)	469.18µs	474.7µs	2023.85555	53.4KB	2.015792
sample_prob(size, prob1000)	34.21ms	34.79ms	28.37365	41.6KB	0.000000

A good way to think about this was described by Keith Schwarz (2011).¹ The stochastic acceptance method can be compared to randomly throwing a dart at a bar chart of the weight distribution. If the weight distribution is very uneven, there is a lot of empty space on the chart, i.e. one has to try very often to not hit the empty space. To quantify this, one can use max_weight / average_weight, which is a measure for how many tries one needs before a throw is successful:

This is 1 for un-weighted distribution.
This is (around) 2 for a random or a linear weight distribution.
This would be the number of elements in the extreme case where all weight is on one element.

The above page also discusses an alternative: The alias method originally suggested by Walker (1974, 1977) in the efficient formulation of Vose (1991), which is also used by R in certain cases. The general idea is to redistribute the weight from high weight items to an alias table associated with low weight items. Let’s implement it in C++:

#include <queue>
// [[Rcpp::depends(dqrng,BH,sitmo)]]
#include <Rcpp.h>
#include <dqrng_distribution.h>

auto rng = dqrng::generator<>(42);

// [[Rcpp::export]]
Rcpp::IntegerVector sample_alias(int size, Rcpp::NumericVector prob) {
  uint32_t n(prob.size());
  std::vector<int> alias(n);
  Rcpp::NumericVector p = prob * n / Rcpp::sum(prob);
  std::queue<int> high;
  std::queue<int> low;
  for(int i = 0; i < n; ++i) {
    if (p[i] < 1.0)
      low.push(i);
    else
      high.push(i);
  }
  while(!low.empty() && !high.empty()) {
    int l = low.front();
    low.pop();
    int h = high.front();
    alias[l] = h;
    p[h] = (p[h] + p[l]) - 1.0;
    if (p[h] < 1.0) {
      low.push(h);
      high.pop();
    }
  }
  while (!low.empty()) {
    p[low.front()] = 1.0;
    low.pop();
  }
  while (!high.empty()) {
    p[high.front()] = 1.0;
    high.pop();
  }
  Rcpp::IntegerVector result(Rcpp::no_init(size));
  std::generate(result.begin(), result.end(),
                [n, p, alias] () {
                    int index = (*rng)(n);
                    if (dqrng::uniform01((*rng)()) < p[index])
                      return index + 1;
                    else
                      return alias[index] + 1;
                });
  return result;
}

First we need to make sure that all algorithms select the different possibilities with the same probabilities, which seems to be the case:

size <- 1e6
n <- 10
prob <- sample(n)
data.frame(
  sample_R = sample_R(size, prob),
  sample_prob = sample_prob(size, prob),
  sample_alias = sample_alias(size, prob)
) |> pivot_longer(cols = starts_with("sample")) |>
ggplot(aes(x = value, fill = name)) + geom_bar(position = "dodge2")

Next we benchmark the three methods for a range of different population sizes n and returned samples size. First for a linear weight distribution:

bp1 <- bench::press(
  n = 10^(1:4),
  size = 10^(0:5),
  {
    prob <- sample(n)
    bench::mark(
                  sample_R = sample_R(size, prob),
                  sample_prob = sample_prob(size, prob = prob),
                  sample_alias = sample_alias(size, prob = prob),
                  check = FALSE,
                  time_unit = "s"
    )
  }
) |>
  mutate(label = as.factor(attr(expression, "description")))
ggplot(bp1, aes(x = n, y = median, color = label)) + 
  geom_line() + scale_x_log10() + scale_y_log10() + facet_wrap(vars(size))

For n > size stochastic sampling seems to work still very well. But when many samples are created, the work done to even out the weights does pay off. This seems to give a good way to decide which method to use. And how about an uneven weight distribution?

bp2 <- bench::press(
  n = 10^(1:4),
  size = 10^(0:5),
  {
    prob <- sample(n)
    prob[which.max(prob)] <- n * n
    bench::mark(
                  sample_R = sample_R(size, prob),
                  sample_prob = sample_prob(size, prob = prob),
                  sample_alias = sample_alias(size, prob = prob),
                  check = FALSE,
                  time_unit = "s"
    )
  }
) |>
  mutate(label = as.factor(attr(expression, "description")))
ggplot(bp2, aes(x = n, y = median, color = label)) + 
  geom_line() + scale_x_log10() + scale_y_log10() + facet_wrap(vars(size))

Here the alias method is the fastest as long as there are more than one element generated. But when is the weight distribution so uneven, that one should use the alias method (almost) everywhere? Further investigations are needed …

References

Keith Schwarz. 2011. “Darts, Dice, and Coins.” https://www.keithschwarz.com/darts-dice-coins/.

Lipowski, Adam, and Dorota Lipowska. 2012. “Roulette-Wheel Selection via Stochastic Acceptance.” Physica A: Statistical Mechanics and Its Applications 391 (6): 2193–96. https://doi.org/10.1016/j.physa.2011.12.004.

Vose, M. D. 1991. “A Linear Algorithm for Generating Random Numbers with a Given Distribution.” IEEE Transactions on Software Engineering 17 (9): 972–75. https://doi.org/10.1109/32.92917.

Walker, Alastair J. 1974. “New Fast Method for Generating Discrete Random Numbers with Arbitrary Frequency Distributions.” Electronics Letters 10 (8): 127. https://doi.org/10.1049/el:19740097.

———. 1977. “An Efficient Method for Generating Discrete Random Variables with General Distributions.” ACM Transactions on Mathematical Software 3 (3): 253–56. https://doi.org/10.1145/355744.355749.

It looks like this method was invented more than once …↩︎

To leave a comment for the author, please follow the link and comment on their blog: R on Ralf Stubner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Roulette-wheel selection for dqrng (part 2)

References

Related

References

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)