Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Summary: in this post, I implemenent an R function for computing \( P(\theta_1 > \theta2) \), where \( \theta_1 \) and \( \theta_2 \) are beta-distributed random variables. This is useful for estimating the probability that one binomial proportion is greater than another.
I am working on a project in which I need to compare two binomial proportions to see if one is likely to be greater than the other. I wanted to do this in a Bayesian framework; fortunately, estimating binomial proportions is one of the first subjects taught to introduce Bayesian statistics (and statistics in general, for that matter) so the method is well established. Rasmus Bååth has a very nice article describing the Bayesian binomial test, and an estimation approach using JAGS.
I made the problem even simpler by using the fact that the beta distribution is the conjugate prior to the binomial distribution. That is, if the prior is \( \mathrm{beta}(\alpha, \beta) \) distributed, then the posterior after observing \( k \) successes in \( n \) trials is
If you want to start with a noninformative prior, then you can use the beta distribution with \( \alpha = \beta = 1\), which collapses to the uniform distribution. This makes the posterior
What I am interested is in comparing two binomial proportions, \( \theta_1 \) and \( \theta_2 \). I want to calculate
We observe \( k_1 \) successes out of \( n_1 \) trials and estimate a posterior distribution for \( \theta_1 \), and \( k_2 \) successes out of \( n_2 \) trials for \( \theta_2 \). Then the probability in question is
So far, so textbook. At this point, the simplest way to get an estimate is to draw two sets of random numbers, distributed by the two respective beta distributions, and see how many times the first random numbers is greater than the second:
This works fine, but it would be nice to have an exact solution. I found several papers describing different approaches.
< !--more-->Nadarajah and Kotz (2007) derived a functional form for the distribution of \( P(\theta_1 – \theta_2) \) (and Chen and Luo fixed a typo in their work in 2011.) This approach is implemented in the tolerance
package via the functions ddiffprop
, pdiffprop
, qdiffprop
, and rdiffprop
. The a1
and a2
parameters let you adjust the priors, where a1 = a2 = 1
is a uniform prior for both proportions.
The functions work well, but I found that they stop working for large values of \( n_1 \) or \( n_2 \). This is because several of the terms in the density function use the gamma function, and the R implementation returns infinity and throws a warning for any input larger than 171.
The second approach I found is described in Raineri et al. 2014, in which they derived a recursive formula for \( P(\theta_1 > \theta_2) \). I unrolled the recursion so it could handle large \( n \) without overflowing the stack, and implemented it in R. The first two functions, h
and g
, come directly from Raineri et al. and pdiff
provides a wrapper for estimating differences in binomial proportions.
It can also handle large values for \( n_1 \) and \( n_2 \):
The results from the simulation, tolerance
package, and this new function all agree, which is a good sign. I ended up using the pdiff
function for my project.
A baseball example
Let’s apply this to a simple example. Suppose we want to estimate an “ability score” for every baseball team, so we can see which teams are better than others. We model each team’s ability score as a binomial random variable representing the team’s proportion of wins to total games. By feeding in each team’s win and loss record for a season, we’ll estimate a posterior distribution that we can use to rank pairs of teams by how likely they are to have different ability scores.
First, we filter for the team records from 2016:
And generate a dataframe that has the win and loss numbers from every pair of teams:
Now, we can use our pdiff
function to calculate the probability that team 1’s win percentage is greater than team 2’s win percentage.
Let’s see which teams are most likely to have a higher win percentage than other teams:
The Cubs (CHN), Rangers, and Nationals all have extremely high posterior probabilities of having a better win percentage than the Minnesota Twins.
Let’s see which teams are most likely to be the same:
Texas and Cleveland had very similar records, so the posterior probability that Texas was better than Cleveland is only 52%.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.