A Bayesian Twist on Tukey’s Flogs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the last post I described flogs, a useful transform on proportions data introduced by John Tukey in his Exploratory Data Analysis. Flogging a proportion (such as, two out of three computers were Macs) consisted of two steps: first we “started” the proportion by adding 1/6 to each of the counts and then we “folded” it using what was basically a rescaled log odds transform. So for the proportion two out of three computers were Macs we would first add 1/6 to both the Mac and non-Mac counts resulting in the “started” proportion (2 + 1/6) / (2 + 1/6 + 1 + 1/6) = 0.65. Then we would take this proportion and transform it to the log odds scale.
The last log odds transform is fine, I’ll buy that, but what are we really doing when we are “starting” the proportions? And why start them by the “magic” number 1/3? Maybe John Tukey has the answer? From page 496 in Tukey’s EDA:
The desirability of treating “none seen below” as something other than zero is less clear, but also important. Here practice has varied, and a number of different usages exist, some with excuses for being and others without. The one we recommend does have an excuse but since this excuse (i) is indirect and (ii) involves more sophisticated considerations, we shall say no more about it. What we recommend is adding 1/6 to all […] counts, thus “starting” them.
Not particularly helpful… I don’t know what John Tukey’s original reason was (If you know, please tell me in the comments bellow!) but yesterday I figured out a reason that I’m happy with. Turns out that starting proportions by adding 1/6 to the counts is an approximation to the median of the posterior probability of the $\theta$ parameter of a Bernouli distribution when using Jeffrey’s prior on $\theta$. I’ll show you in a minute why this is the case but first I just want to point out that intuitively this is roughly what we would want to get when “starting” proportions.
The reason for starting proportions in the first place was to gracefully handle cases such as 1/1 and 0/1 where the “raw” proportions would result in the edge proportions 100 % and 0 %. But surely we don’t believe that a person is a 100 % Apple fanboy just after getting to know that that person has one Mac computer out of in total one computer. That person is probably more likely to purchase a new Mac rather that a new PC but to estimate this probability as 100 % just after seeing one data point seems a bit rash. This kind of thinking is easily accommodated in a Bayesian framework by using a prior distribution on the proportion/probability parameter $\theta$ of the Bernoulli distribution that puts some prior probability on all possible proportions. One such prior is Jeffrey’s prior which is $\theta \sim Beta(0.5,0.5)$ and looks like:
theta <- seq(0, 1, length.out = 1000) plot(theta, dbeta(theta, 0.5, 0.5), type = "l", ylim = c(0, 3.5), col = "blue", lwd = 3)
So instead of calculating the raw proportions we could estimate them using the following model:
$$x_i \sim \text{Bernouli}(\theta)$$
$$\theta \sim \text{Beta}(0.5,0.5)$$
Here $x_i$ is a vector of zeroes and ones coding for, e.g., non-Macs (0) and Macs (1). This is easy to estimate using the fact that the posterior distribution of $\theta$ is given by $\text{Beta}(0.5 + n_1, 0.5 + n_0)$ where $n_0$ and $n_1$ are the number of zeroes and ones respectively. Running with our example of the guy with the one single Mac the estimated posterior distribution of the proportion $\theta$ would be:
theta <- seq(0, 1, length.out = 1000) plot(theta, dbeta(theta, 0.5 + 1, 0.5 + 0), type = "l", ylim = c(0, 3.5), col = "blue", lwd = 3) abline(v = median(rbeta(99999, 0.5 + 1, 0.5 + 0)), col = "red", lty = 2, lwd = 3)