Sampling without replacement with unequal probabilities by @ellis2013nz

[This article was first published on free range statistics - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Not proportional to w

A week ago I was surprised to read on Thomas Lumley’s Biased and Inefficient blog that when using R’s sample() function without replacement and with unequal probabilities of individual units being sampled:

“What R currently has is sequential sampling: if you give it a set of priorities w it will sample an element with probability proportional to w from the population, remove it from the population, then sample with probability proportional to w from the remaining elements, and so on. This is useful, but a lot of people don’t realise that the probability of element i being sampled is not proportional to w_i”

This surprised me profoundly – in the way that Bertrand Russell once wrote of surprise in some epistemological context of getting to the bottom of the stairs and finding there was one more stair to go than you unconsciously expected. That is, I hadn’t even been consciously aware of assuming that the probability of an element being sampled is proportional to the weight given to it, but if I had been asked if this was the case I would have given the wrong answer (‘yes’).

So I didn’t know this, but I should have.

Thomas had some examples in his blog relating to streams of data but I wanted to see for myself. So I wrote a function to simulate sampling from a small, finite population, with equal or unequal weights. So mostly, I sampled 10 units at a time from a population of 20; and gave them weights from about 0.005 to 0.095 ( a big range). The weights are forced to add up to 1 so they can be easily compared with that element’s proportion of the eventual sample of samples. And then I take some thousands of these samples of 10, and look at the proportion of the total collection that is each element.

In the charts the follow, the brown diagonal line would be where the eventual proportion of element i being sampled is proportional to its weight. First, here is the main situation of concern – unequal weights, sampling without replacement, sample size is quite a big proportion of the population (10 out of 20):

OK! so we see the elements with low weights get sampled a bit more than naively expected, and the elements with high weights a bit less.

Just to check I’m not dreaming, here’s the same simulation but this time we are sampling with replacement. Now, everything works as intuitively might be expected, the eventual proportion in our sample of samples is exactly matched to the original weights:

Here’s the code for that function and those two runs of it.

library(tidyverse)
library(glue)

compare_ppswor <- function(n = 10,
                           N = 20,
                           replace = FALSE,
                           reps = 1e5,
                           prob = 1:N / sum(1:N),
                           FUN = sample){

  x <- paste0("unit", 1:N)
  x <- factor(x, levels = x)
  prob <- prob / sum(prob)
  
  
  samples <- lapply(rep(n, reps), function(size){
    FUN(x, size = size, prob = prob, replace = replace)
    })
  
  s <- ifelse(replace, 'with replacement', 'without replacement')
  m <- case_when(identical(FUN, sample) ~ "R's native `sample()` function.", 
                 identical(FUN, sample_unequal) ~ "experimental function based on Brewer (1975).",
                 identical(FUN, sample_brewer) ~ "Brewer (1975) as implemented by Tillé/Matei.")
  
  p <- tibble(
    original = prob,
    selected = as.numeric(table(unlist(samples))) / (reps * n)
  )  |>
    ggplot(aes(x = original, y = selected)) +
    geom_abline(slope = 1 , intercept = 0, colour = "orange") +
    geom_point(colour = "steelblue") +
    labs(x = "Original 'probability' or weight",
         y = "Actual proportion of selections",
         subtitle = glue("Population of {N}, sample size {n}, sampling {s}.\nUsing {m}"),
         title = "Use of `sample()` with unequal probabilities of sampling") +
    coord_equal() 
  
  return(p)
}

compare_ppswor()
compare_ppswor(replace = TRUE)

When the sample is a smaller proportion of a population, the no-replacement discrepancy is less, as seen in these examples with population sizes of 50 and 250.

At the extreme, if the sample size is the same size as the population (20 each in the simulation below), then of course, all samples without replacement are identical to the population, so it is impossible for the end proportion to be anything other than 1/N each:

Brewer’s 1975 algorithm

Thomas was writing because there are plans to change this behaviour of sample(), or at least add an option to change it, so instead the eventual proportion of samples containing an element will be proportional to the weight even when sampling with unequal probabilities and without replacement. The method that is going to be implemented in R (written in C of course before being hooked into R – it’s an inherently iterative thing that will only be efficient in compiled code) is based on a method published way back in 1975 by Ken Brewer.

An aside – I had the privilege and pleasure to be taught sampling by Professor Brewer back in the 1990s at the Australian National Universitiy – at least it was a pleasure at the time (sheer entertainment value as well as the pleasure of witnessing someone who was a virtuoso in the subject matter, master of the theory of sampling), and I now belatedly realise what a privilege it was. I do think he struggled occasionally with some of the students though; one of my vivid memories of my whole education at ANU was of him in confusion when one of the less mathematically inclined asked him to explain what an equation taking up a fair proportion of the blackboard “means” – “Means? means? it means what it says!”

I don’t have easy access to the 1975 paper but I found this very interesting question relating to it on Cross-Validated. From that discussion and reading the part of the SAS manual referred to, it sounds like SAS implements the algorithm only for when the sample in a given strata is 2, but Brewer had generalised his method to larger samples. The answer given by the ever-reliable StasK seems pretty clear on how this is done, so presuming he got it right I had a go at implementing this method into R, with code below in the form of two functions P() and sample_unequal():

#' @param p probabilities of remaining units
#' @param n total sample size
#' @param k which sequence of the sample this is for the sampling of
P <- function(p, n, k){
  r <- n - k + 1
  D <- sum((p * (1 - p)) / (1 - r * p))
  new_p <- p * (1 - p) / (D * (1 - r * p))
  return(new_p)
}

sample_unequal <- function(x, size, prob, replace = FALSE, keep = FALSE){

  if(size > length(x)){
    stop("Sample size cannot be larger than the population of units")
  }
  
  if(replace){
    stop("Only sampling without replacement implemented at this point")
  }
  
  if(size > (1 / max(prob))){
    stop(glue("Sample size ({size}) cannot be larger than 1 / max(prob) (which is {1 / max(prob)})"))
  }
  
  d <- tibble(x, prob)
  
  the_sample <- x[NULL]
  remnants <- x
  remnant_p <- prob
  
  for(k in 1:size){
    new_p <- P(remnant_p, size, k)
    
    if(keep){
      d2 <- tibble(x = remnants, prob = new_p)
      names(d2)[2] <- paste0("k", k)
      d <- d |>
        left_join(d2, by = "x")
    }
    
    if(min(new_p) < 0){
      warning("Some negative probabilities returned")
      new_p <- pmax(0, new_p)
    }
    
    latest_sample <- sample(remnants, size = 1, prob = new_p)
    which_chosen <- which(remnants == latest_sample)
    remnants <- remnants[-which_chosen]
    remnant_p <- remnant_p[-which_chosen]
    the_sample <- c(the_sample, latest_sample)
    
      }
  
  if(keep){
    return(list(the_sample = the_sample, d = d)) 
  } else {
    return(the_sample)
  }
}  

I’m not guaranteeing I got it right - feedback is welcome.

Of course, I designed this to be plugged into my function I made earlier for comparing eventual sample proportions to the weights given, so with compare_ppswor(FUN = sample_unequal, reps = 10000) I can get this comparison:

Interestingly, it’s still not exactly ‘right’ in the sense that the eventual proportions in the sample aren’t exactly those given in the form of weights. But it’s much better at doing this than the current sample() method is.

And if we have a sample that is say 25% of the population instead of 50% with compare_ppswor(n = 5, FUN = sample_unequal, reps = 10000), then we get results where the proportions and weights are pretty indistinguishable:

OK that’s all. I played around with this a bit more, particularly trying to understand the limitations on the algorithm which breaks down as the sample size gets bigger in relation to the population, but none of it was really interesting enough to write up. A full script with a bit more in it is available in the source code for this blog.

I guess the meta-lesson is to watch out for things you’re assuming without noticing you are.

Late addition!

When I posted about this on Mastodon, Thomas Lumley drew my attention to the sampling library by Yves Tillé and Alina Matei, which contains multiple algorithms for sampling with unequal probabilities without replacement. I hadn’t even stopped to look because really this was about me learning, but yes, of course there’s an R package for that. So I wrapped their UPbrewer function in a little function to work with my comparison. Unlike me they had read Brewer’s actual article, and their implementation of course works perfectly:

So the version I hacked together based on a Cross-Validated answer clearly doesn’t do justice to the method proposed by Brewer (or more likely was some other method he proposed separately).

For the record here’s the little convenience function to test the Tillé/Matei implementation:

#--------------using sampling library------------------
library(sampling)
sample_brewer <- function(x, size, prob, replace = FALSE, keep = FALSE){
 pik <- prob / sum(prob) * size
 s <- UPbrewer(pik)
 the_sample <- x[which(s == 1)]
 return(the_sample)
 }

compare_ppswor(FUN = sample_brewer, reps = 10000)

I’ts much faster too!

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)