Bayesian and Frequentist Approaches: Ask the Right Question
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best way to answer a question. Once you have the right question, then the right approach will naturally suggest itself to you. It could be a frequentist approach, it could be a bayesian one, it could be both — even while solving the same problem.
Let’s take the example that Bayesians love to hate: significance testing, especially in clinical trial style experiments. Clinical trial experiments are designed to answer questions of the form “Does treatment X have a discernible effect on condition Y, on average?” To be specific, let’s use the question “Does drugX reduce hypertension, on average?” Assuming that your experiment does show a positive effect, the statistical significance tests that you run should check for the sorts of problems that John discussed in our previous article, Worry about correctness and repeatability, not p-values: What are the chances that an ineffective drug could produce the results that I saw? How likely is it that another researcher could replicate my results with the same size trial?
We can argue about whether or not the question we are answering is the correct question — but given that it is the question, the procedure to answer it and to verify the statistical validity of the results is perfectly appropriate.
So what is the correct question? From your family doctor’s viewpoint, a clinical trial answers the question “If I prescribe drugX to all my hypertensive patients, will their blood pressure improve, on average?” That isn’t the question (hopefully) that your doctor actually asks, though possibly your insurance company does. Your doctor should be asking “If I prescribe drugX to this patient, the one sitting in my examination room, will the patient’s blood pressure improve?” There is only one patient, so there is no such thing as “on average.”
If your doctor has a masters degree in statistics, the question might be phrased as “If I prescribe drugX to this patient, what is the posterior probability that the patient’s blood pressure will improve?” And that’s a bayesian question.Let’s run through a small toy example. We will run a 500 patient clinical trial on drugX. All the patients have “moderately high” blood pressure, and are of similar age, health and family history, and so on. We will measure whether or not drugX reduces their blood pressure to “normal” — somewhere in the region of 120/80. The control group will be on the sort of diet recommended for hypertensive patients(say a low-sodium, low-cholesterol, high-fiber diet) and will take a placebo. The treatment group will be on the same diet, plus drugX.
Now suppose that the diet alone will normalize blood pressure in about 10% of the population. And also suppose that (unknown to the researchers) there is a hidden factor HF (a genetic factor, perhaps) that moderates whether or not drugX actually works. There are two types of people. 90% of the population are HFA, and drugX has no effect on them. 10% of the population are HFB, and drugX completely normalizes blood pressure for 95% of the HFB population.
So you, the omnipotent readers of this article, now know that drugX is only effective on about 9.5% of the general population, although an overlapping 10% of the general population will show improvement from diet alone. This gives you the luxury of comparing the “right answer” with what could happen in an actual experiment. Now let’s see what might be observed.
Here’s some R code to simulate the trial.
# # 2 populations: HFA, HFB. A not affected by the drug # n = 500; spontaneous = 0.1 # effectiveness of diet alone effectiveness = c(0, 0.95) names(effectiveness) = c("HFA", "HFB") # set the HF for the population hfcoin = runif(n) hf = ifelse(hfcoin < 0.9, "HFA", "HFB") # assign control and treatment groups group = runif(n) group = ifelse(group<0.5, "control", "drug") # assign outcomes spontcoin = runif(n) drugcoin = runif(n) outcome = ( (spontcoin < spontaneous) | (drugcoin < effectiveness[hf]*(group=="drug")) ) expframe=data.frame(group=group,hf=hf, improved=outcome)
Here are the summaries I got when I ran the code:
> summary(expframe) group hf improved control:255 HFA:449 Mode :logical drug :245 HFB: 51 FALSE:437 TRUE :63 NA's :0 > with(expframe[group=="control",], table(hf, improved)) improved hf FALSE TRUE HFA 209 16 HFB 26 4 > with(expframe[group=="drug",], table(hf, improved)) improved hf FALSE TRUE HFA 201 23 HFB 1 20 > tab = with(expframe, table(group, improved)) > tab improved group FALSE TRUE control 235 20 drug 202 43
The last contingency table, tab
, is the only of the above summaries known to the researchers. From it, you can see that the drug group had a 100*43/(202+43) = 17.5% improvement rate, and the control group had a 7.8% improvement rate. So, empirically, drugX more than doubled the probability of improvement (17.5/7.8 = 2.25 — this is called the risk ratio). If you think in odds like a gambler does (odds of improvement are 20 to 235 for the control group), then we have also more than doubled the odds of improvement ( (43/202)/(20/235) = 2.5 — this is called the odds ratio). Now we want to test if these results are real (and not a fluke).
Frequentist Approach
One way to check the significance of the results (from a frequentist viewpoint) is check whether the contingency table tab
is independent. Under the null hypothesis that improvement is independent of whether or not the patient took the drug, the odds of improvement should be the same for both the control and the drug groups. We can test this using Fisher’s Exact Test for Count Data (or we can use the chi-squared test, which is an approximation of Fisher’s exact test). In Fisher’s test, the null hypothesis is that the odds ratio is 1.
> fisher.test(tab) Fisher's Exact Test for Count Data data: tab p-value = 0.001158 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.384802 4.635614 sample estimates: odds ratio 2.496745
So now we know that our results are significant to the 0.05 level (in fact, to the 0.01 level: if the drug had no effect, we would see a result this good or better no more than 1% of the time). We also know that if our estimate of the odds ratio is correct, then when other researchers repeat our experiment, 95% of the time they will see an odds ratio between about 1.38 to 4.63 — definitely greater than one. So we can reject the null hypothesis and assume that drugX will increase the improvement rate in the population, relative to diet alone.
Bayesian Approach
But what about the poor patient sitting in the doctor’s examination room? What are the chances that his blood pressure will improve if he takes drugX? Roughly 17%, which is better than the 10% chance from dietary changes alone, but still isn’t very high. Let’s verify this statement using the bayesian approach.
The bayesian approach assumes that the quantity that you are interested in, in this case the rate of improvement p, is distributed according to some distribution Prior(p). Once you have a set of observations, x, you update your estimate of the distribution to
Posterior(p | x) = C * Prior(p) * f(x | p),
where f is the probability of the data conditioned on the parameter, and C is the total probability of the data over all possible settings of the parameter. Usually, calculating C is hard. Fortunately, for some common scenarios, like coin-flipping, calculating the posterior is quite easy.
Estimating the improvement rate p of drugX is a coin-flipping problem, where p is the (unknown) probability of the coin coming up heads. If you model the coin as a binomial distribution, and the distribution of p as a Beta distribution:
then the posterior is also a Beta distribution, with α’ = α + nheads and β’ = β + ntails.
The mode of the distribution (which is what is usually used as a point estimate for p) is
The mean of the distribution (which is close to the mode when α and β are large) is
Now back to our problem. Suppose that we already knew (never mind how) that the hypertension improvement rate from diet alone was about 10%. We can set the prior to have a mean value of 0.1 by setting α = 0.1 and β = 0.9. That looks like this:
It’s an nasty prior — notice it goes to infinity at both 0 and 1 — but it spreads the probability mass all along the unit interval, which is what we want, since we don’t want to start with a very strong bias about the improvement rate. Another common prior is the Jeffrey’s prior: α = β = 0.5. The Jeffrey’s prior is maximally uninformative (or minimally biased) and has a mean of 0.5.
Now let’s calculate the posterior, its mean and its mode, in R:
# The mean of the Beta distribution beta_mean = function(alpha, beta) alpha/(alpha+beta) } # The mode of the Beta Distribution beta_mode = function(alpha, beta) (alpha+1)/(alpha+beta-2) } # prior, mean 0.1, mode not defined alpha = 0.1 beta = 0.9 # The values from the contingency table for the experiment improved.control = tab[1,2] # 20 notimproved.control = tab[1,1] # 235 improved.drug = tab[2,2] # 43 notimproved.drug = tab[2,1] # 202 # update the distribution for the treatment group alpha.drug = alpha + improved.drug beta.drug = beta + notimproved.drug # calculate the mean and the mode for the treatment group beta_mean(alpha.drug, beta.drug) # 0.1752033 beta_mode(alpha.drug, beta.drug) # 0.1807377 # update the distribution for the control group alpha.control = alpha + improved.control beta.control = beta + notimproved.control # calculate the mean and the mode for the control group beta_mean(alpha.control, beta.control) # 0.07851563 beta_mode(alpha.control, beta.control) # 0.08307087 # plot both distributions to compare # the function dbeta() returns the value of the distribution # at point x, for a given alpha and beta x=seq(from=0.0, to=0.3,by=0.005) frame=melt(data.frame(x=x, control=dbeta(x,alpha.control,beta.control), drug=dbeta(x,alpha.drug,beta.drug)), measure.vars=c("control", "drug"), variable.name="treatment", value.name="y") ggplot(frame, aes(x=x,y=y,color=treatment)) + geom_line()
The means and modes of both distributions are about where we estimated them from the naive calculations directly on the contingency table; if we use the mode as our point estimate of the improvement rates for both groups, then the spreads of the distributions give us the uncertainty around that estimate, based on the size of our data sample. The distributions don’t overlap much (the result we expected, based on our frequentist analysis); the two populations do in fact have different improvement rates. The difference (mostly a philosophical one, perhaps) is that this analysis gives us directly what our family doctor wants: an estimate of the posterior probability of a patient’s blood pressure improving when prescribed drugX. We can calculate what is called the 95% credible interval for each distribution: the interval that with 95% probability contains the true improvement rate:
credible_interval= function(conf, alpha, beta){ p = (1-conf) lower = p/2 upper = 1-lower c(qbeta(lower, alpha, beta), qbeta(upper, alpha, beta)) } credible_interval(0.95, alpha.drug, beta.drug) # 0.1303853 0.2250201 credible_interval(0.95, alpha.control, beta.control) # 0.04887709 0.11437549
Based on this data, if you take drugX for your hypertension, the probability of normalizing your blood pressure is likely somewhere in the range of 13 to 22 percent, compared to 4.8 to 11 percent from diet alone. So you will improve your chance of normalizing your blood pressure — but it’s more likely that your blood pressure will remain high.
The credible interval, by the way, is what most people think the confidence interval is. With 95% probability (based on the available evidence), the true improvement rate is in the 95% credible interval. The 95% confidence interval is the interval that is produced by a construction procedure such that, if you repeated the experiment again and again, the constructed confidence interval contains the true improvement rate 95% of the time. This still makes it likely that you’ve bracketed the true improvement rate, and in practice, the confidence interval is probably a good stand-in for the credible interval. It’s just not really answering the question you actually asked, philosophically speaking.
The Hidden Factor
Suppose the researchers had suspected that the hidden factor HF might be implicated in the drug’s performance, and had been able to measure it in the experiment.
tabfull = aggregate(numeric(dim(expframe)[1])+1, by=list(expframe$group, expframe$hf, expframe$improved), FUN=sum) > tabfull Group.1 Group.2 Group.3 x 1 control HFA FALSE 209 2 drug HFA FALSE 201 3 control HFB FALSE 26 4 drug HFB FALSE 1 5 control HFA TRUE 16 6 drug HFA TRUE 23 7 control HFB TRUE 4 8 drug HFB TRUE 20
In this case, we can also estimate the posterior probabilities of improvement for each group, using the bayesian approach. I’ll just give you the graph.
From this evidence, HFB people taking drugX have better than 75% probability of improving their blood pressure; everyone else has probability less than 25%. Just looking at the modes of the distributions, you might naively think that HFA people also have a higher improvement rate when they are taking drugX, or that HFB people have a higher improvement rate than HFA people even in the control group. But the distributions overlap substantially; there is no real evidence that the three groups on the left of the graph have different improvement rates. In other words, if your family doctor knows that you are type HFB, it would make sense to prescribe drugX for your high blood pressure; if you are type HFA, then it doesn’t.
This is the kind of reasoning promoted by the personalized medicine movement. In fact it is what your family doctor already tries to do, by taking into account your family and previous health history, and so on. So far, your doctor can only do this in a negative way — if you have a family history of colon cancer, then start your annual colonoscopies sooner, otherwise, don’t bother — and as far as I know (though I’m not a doctor or a medical researcher) most published medical research isn’t designed to help doctors make “bayesian type” assessments in a more positive way.
But Don’t Throw Out Frequentism
So we’ve established that determining individual patient outcomes is a bayesian question. You might then wonder why anyone would use the frequentist approach at all. But some problems really are frequentist. A medical practitioner who is in public health rather than in a direct patient care practice is interested in the effects of treatments over entire populations, rather than on individuals. Similarly, an insurance company that is deciding whether or not to approve coverage for drugX is interested in whether the drug helps anyone, at all, or if the drug is no better than diet alone. In those situations, a frequentist analysis of drugX does in fact answer the question that is being asked.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.