Site icon R-bloggers

Bayesian Binomial Test in R

[This article was first published on Silent Spring Institute Developer Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Summary: in this post, I implemenent an R function for computing \( P(\theta_1 > \theta2) \), where \( \theta_1 \) and \( \theta_2 \) are beta-distributed random variables. This is useful for estimating the probability that one binomial proportion is greater than another.

I am working on a project in which I need to compare two binomial proportions to see if one is likely to be greater than the other. I wanted to do this in a Bayesian framework; fortunately, estimating binomial proportions is one of the first subjects taught to introduce Bayesian statistics (and statistics in general, for that matter) so the method is well established. Rasmus Bååth has a very nice article describing the Bayesian binomial test, and an estimation approach using JAGS.

I made the problem even simpler by using the fact that the beta distribution is the conjugate prior to the binomial distribution. That is, if the prior is \( \mathrm{beta}(\alpha, \beta) \) distributed, then the posterior after observing \( k \) successes in \( n \) trials is

If you want to start with a noninformative prior, then you can use the beta distribution with \( \alpha = \beta = 1\), which collapses to the uniform distribution. This makes the posterior

What I am interested is in comparing two binomial proportions, \( \theta_1 \) and \( \theta_2 \). I want to calculate

We observe \( k_1 \) successes out of \( n_1 \) trials and estimate a posterior distribution for \( \theta_1 \), and \( k_2 \) successes out of \( n_2 \) trials for \( \theta_2 \). Then the probability in question is

So far, so textbook. At this point, the simplest way to get an estimate is to draw two sets of random numbers, distributed by the two respective beta distributions, and see how many times the first random numbers is greater than the second:

set.seed(1)

k1 <- 5 # Five successes
k2 <- 1 # One success
n1 <- 10 # Ten trials
n2 <- 10 # Ten trials

b1 <- rbeta(10000, 1 + k1, 1 + n1 - k1)
b2 <- rbeta(10000, 1 + k2, 1 + n2 - k2)

# Probability estimate of P(theta_1 > theta_2)
sum(b1 > b2) / 10000
## [1] 0.9688

This works fine, but it would be nice to have an exact solution. I found several papers describing different approaches.

< !--more-->

Nadarajah and Kotz (2007) derived a functional form for the distribution of \( P(\theta_1 – \theta_2) \) (and Chen and Luo fixed a typo in their work in 2011.) This approach is implemented in the tolerance package via the functions ddiffprop, pdiffprop, qdiffprop, and rdiffprop. The a1 and a2 parameters let you adjust the priors, where a1 = a2 = 1 is a uniform prior for both proportions.

library(tolerance)

# Probability that the first proportion (5 successes out of 10 trials)
# is greater than the second (1 success out of 10 trials)
pdiffprop(0,
          k1 = 5,
          k2 = 1,
          n1 = 10,
          n2 = 10,
          a1 = 1,
          a2 = 1,
          lower.tail = FALSE)
## [1] 0.9682663

The functions work well, but I found that they stop working for large values of \( n_1 \) or \( n_2 \). This is because several of the terms in the density function use the gamma function, and the R implementation returns infinity and throws a warning for any input larger than 171.

The second approach I found is described in Raineri et al. 2014, in which they derived a recursive formula for \( P(\theta_1 > \theta_2) \). I unrolled the recursion so it could handle large \( n \) without overflowing the stack, and implemented it in R. The first two functions, h and g, come directly from Raineri et al. and pdiff provides a wrapper for estimating differences in binomial proportions.

h <- function(a, b, c, d) {
  exp(lbeta(a + c, b + d) - (lbeta(a, b) + lbeta(c, d)))
}

# probability that beta(a, b) > beta(c, d)
g <- function(a, b, c, d) {
  stopifnot(a >= 1 && b >= 1 && c >= 1 && d >= 1)
  accum <- 0.5
  while (a != c || b != d) {
    if (a > c) {
      accum <- accum + h(a - 1, b, c, d) / (a - 1)
      a <- a - 1
    } else if (b > d) {
      accum <- accum - h(a, b - 1, c, d) / (b - 1)
      b <- b - 1
    } else if (c > a) {
      accum <- accum - h(a, b, c - 1, d) / (c - 1)
      c <- c - 1
    } else if (d > b) {
      accum <- accum + h(a, b, c, d - 1) / (d - 1)
      d <- d - 1
    }
  }
  accum
}

pdiff <- function(k1, k2, n1, n2, a1 = 1, a2 = 1) {
  g(1 + k1, a1 + n1 - k1, 1 + k2, a2 + n2 - k2)
}

# Probability that the first proportion (5 successes out of 10 trials)
# is greater than the second (1 success out of 10 trials)
pdiff(k1 = 5,
      k2 = 1,
      n1 = 10,
      n2 = 10)
## [1] 0.9682663

It can also handle large values for \( n_1 \) and \( n_2 \):

pdiff(k1 = 420,
      k2 = 400,
      n1 = 1000,
      n2 = 1000)
## [1] 0.8182686

The results from the simulation, tolerance package, and this new function all agree, which is a good sign. I ended up using the pdiff function for my project.

A baseball example

Let’s apply this to a simple example. Suppose we want to estimate an “ability score” for every baseball team, so we can see which teams are better than others. We model each team’s ability score as a binomial random variable representing the team’s proportion of wins to total games. By feeding in each team’s win and loss record for a season, we’ll estimate a posterior distribution that we can use to rank pairs of teams by how likely they are to have different ability scores.

First, we filter for the team records from 2016:

library(tidyverse)
library(Lahman)
library(knitr)

teams <- Teams %>%
  filter(yearID == 2016) %>%
  select(teamID, W, L) %>%
  mutate(teamID = as.character(teamID),
         P = paste0(round(W / (W + L) * 100), "%"))

And generate a dataframe that has the win and loss numbers from every pair of teams:

teams_compared <- expand.grid(teams$teamID, teams$teamID, stringsAsFactors = FALSE) %>%
  rename(team1 = Var1, team2 = Var2) %>%
  filter(team1 != team2) %>%
  left_join(teams, by = c(team1 = "teamID")) %>%
  left_join(teams, by = c(team2 = "teamID"), suffix = c("_1", "_2"))
  
head(teams_compared)
##   team1 team2 W_1 L_1 P_1 W_2 L_2 P_2
## 1   ATL   ARI  68  93 42%  69  93 43%
## 2   BAL   ARI  89  73 55%  69  93 43%
## 3   BOS   ARI  93  69 57%  69  93 43%
## 4   CHA   ARI  78  84 48%  69  93 43%
## 5   CHN   ARI 103  58 64%  69  93 43%
## 6   CIN   ARI  68  94 42%  69  93 43%

Now, we can use our pdiff function to calculate the probability that team 1’s win percentage is greater than team 2’s win percentage.

teams_compared <- teams_compared %>%
  mutate(p_1_greater = pmap(list(k1 = W_1, k2 = W_2, n1 = W_1 + L_1, n2 = W_2 + L_2), pdiff)) %>%
  unnest(p_1_greater)

Let’s see which teams are most likely to have a higher win percentage than other teams:

teams_compared %>%
  filter(P_1 > P_2) %>%
  arrange(-p_1_greater) %>%
  head()
##   team1 team2 W_1 L_1 P_1 W_2 L_2 P_2 p_1_greater
## 1   CHN   MIN 103  58 64%  59 103 36%   0.9999997
## 2   TEX   MIN  95  67 59%  59 103 36%   0.9999694
## 3   WAS   MIN  95  67 59%  59 103 36%   0.9999694
## 4   CHN   CIN 103  58 64%  68  94 42%   0.9999631
## 5   CHN   SDN 103  58 64%  68  94 42%   0.9999631
## 6   CHN   TBA 103  58 64%  68  94 42%   0.9999631

The Cubs (CHN), Rangers, and Nationals all have extremely high posterior probabilities of having a better win percentage than the Minnesota Twins.

Let’s see which teams are most likely to be the same:

teams_compared %>%
  filter(P_1 > P_2) %>%
  arrange(p_1_greater) %>%
  head()
##   team1 team2 W_1 L_1 P_1 W_2 L_2 P_2 p_1_greater
## 1   TEX   CLE  95  67 59%  94  67 58%   0.5186491
## 2   WAS   CLE  95  67 59%  94  67 58%   0.5186491
## 3   NYN   DET  87  75 54%  86  75 53%   0.5206036
## 4   SFN   DET  87  75 54%  86  75 53%   0.5206036
## 5   ARI   ATL  69  93 43%  68  93 42%   0.5257282
## 6   OAK   ATL  69  93 43%  68  93 42%   0.5257282

Texas and Cleveland had very similar records, so the posterior probability that Texas was better than Cleveland is only 52%.

To leave a comment for the author, please follow the link and comment on their blog: Silent Spring Institute Developer Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.