Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
I saw an interesting problem that requires Bayes’ Theorem and some simple R programming while reading a bioinformatics textbook. I will discuss the math behind solving this problem in detail, and I will illustrate some very useful plotting functions to generate a plot from R that visualizes the solution effectively.
The Problem
The following question is a slightly modified version of Exercise #1.2 on Page 8 in “Biological Sequence Analysis” by Durbin, Eddy, Krogh and Mitchison.
An occasionally dishonest casino uses 2 types of dice. Of its dice, 97% are fair but 3% are unfair, and a “five” comes up 35% of the time for these unfair dice. If you pick a die randomly and roll it, how many “fives” in a row would you need to see before it was most likely that you had picked an unfair die?”
Translating the Problem into Math
To translate this problem into the language of probability,
- let
denote the outcome of rolling any die. The possible outcomes are . - let
denote whether the die is fair or not. The possible outcomes are . - let
denote the number of consecutive times a “five” is observed after rolling a die. The possible outcomes are all non-negative integers.
We know that
The die that you picked is either fair or unfair, regardless of how many times you roll the die. Thus, given that we observe
To see why this equation formulates our problem accurately, let’s assume that it is true. Combining Equation
and this is the premise of the question.
Using Bayes’ Theorem
To calculate
- stating the definition of conditional probability,
- applying the law of total probability.
To apply the definition of conditional probability for our problem,
Now, re-write the joint probability in the numerator by applying the definition of conditional probability again, but reversing the conditionality.
Finally, apply the law of total probability to the denominator. (Because of the difficulties of writing LaTeX in WordPress, the entire denominator is written on the second line. Please excuse the awkward line placement.)
By Equation
Solving for X
To solve the problem, we need to find the minimum value of
Notice that I needed to use the axis() function twice to add all of the tick marks that I wanted. For some reason, I could not add ’0′ and ’1′ among the first set of tick marks, but I could add them separately in a second calling of axis().
I also found a document by Tian Zheng from Columbia University that lists the names of various colours in R. It turns out that there are many more colours than the standard ones that I usually use from the colours of a rainbow!
As the plot shows, if you see 5 or more consecutive “fives”, you have reason to suspect that the die is unfair.
##### Detecting an Unfair Die with Bayes' Theorem ##### By Eric Cai - The Chemical Statistician # set the possible values of the number of consecutive "fives" observed from rolling a die x = 1:20 # calculate the probability of the die being unfair given the observed number of consecutive "fives" prob.loaded = (0.35^x)*0.03/((0.35^x)*0.03 + ((1/6)^x)*0.97) # export the plot as a PNG image file png('INSERT YOUR DIRECTORY PATH HERE/unfair die plot.png') # plot the calculated probabilities # notice that the "yaxt = 'n'" option suppresses the verticle axis; I want to add my own custom axis later plot(x, prob.loaded, col=ifelse(x == 5, "forestgreen", "black"), ylab = 'P(Die is Fair | X)', ylim = c(-0.05, 1.05), xlab = 'X - Number of Consecutive Fives Observed', yaxt='n', main = "Using Bayes' Theorem to Detect an Unfair Die") # add a custom horizontal line to show where P(Die = Unfair | X = x) = 0.5 abline(0.5, 0, col = 'firebrick1') # add a custom tick mark with the colour 'firebrick1' # the label is left intentionally blank, because its colour can only be black # use mtext() to set the tick label's colour axis(2, at = c(0.5), col = 'firebrick1', labels = ' ') # add a custom tick label at P(Die = Unfair | X = x) = 0.5 # distinguish it with the colour 'firebrick1' mtext(text = '0.5', side = 2, line = 1, col = "firebrick1") # add text to denote the point of interest text(9, 0.55, 'P(Die is loaded | X = 5) = 0.5581', col = 'forestgreen') # add some other tick marks to show the other values along the vertical axis axis(2, at = c(0.1, 0.3, 0.7, 0.9), labels = c('0.1', '0.3', '0.7', '0.9')) axis(2, at = c(0, 1), labels = c('0', '1')) dev.off()
Reference
Durbin, R. (Ed.). (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press.
Filed under: Plots, Practical Applications of Chemistry, Probability, R programming Tagged: abline(), axis(), Bayes’ Theorem, data visualization, dev.off(), dice, die, mtext(), plot, plots, plotting, PNG, probability, R, R programming, statistics, text
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.