Site icon R-bloggers

An intuitive real life example of a binomial distribution and how to simulate it in R: Learn it once, use it everyday

[This article was first published on Posts | SERDAR KORUR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, I came across a data that I thought it is a great opportunity to write about Binomial probability distributions.

What is a binomial distribution and why we need to know it?

Binomial distributions are formed when we repeat a set of events and each single event in a set has two possible outcomes. Bi- in binomial distributions refers to those outcomes. Two possibilities are usually described as Success or no Success. A “yes” or “no”.

For example, in flipping coins, the two possibilities are getting a head (success) or tails (no success) or vice versa. And one set can be, e.g. 10 or 50 times coin flipping. We can repeat this set as many times as we like and record how many times we got heads (success) in each repetition.

Why is this interesting?

What is the probability of getting 25 heads out of 50 coin flips? This is not very unusual. But how about getting 49 heads out of 50 flips? Is this really possible?

Let’s figure it out. We can repeat this set of 50 times coin flipping 100.000 times and record the results of each set.

Luckily, we can simulate this in R.

We can simulate a given number of repeated (here 100.000) sets (50 times of coin flipping) of experiments with rbinom() function. It takes 3 arguments.

rbinom(
n = number of repetitions = 100.000,
size = sample size = 50,
p = the probability of success (chance of throwing heads is 0.5))

Let’s compare the probabilities of getting more than 25, 35 or even 49 heads. You can combine rbinom with mean function to find the percentage of the events with a chosen outcome.

# Probability of getting 25 or more heads
mean(rbinom(100000, 50, .5) >= 25)
## [1] 0.55649
# Probability of getting 35 or more heads
mean(rbinom(100000, 50, .5) >= 35)
## [1] 0.00301
# Probability of getting 49 or more heads
mean(rbinom(100000, 50, .5) >= 49)
## [1] 0

We found the probability of throwing 49 or more heads to be 0. But to be technically precise it is one in 375 trillion times (= \(1/((1/(2^{49}) + (1/2^{50})))\)).

Let’s visualize our simulation. The bars in red represents the sets which had 35 or more heads.

library(tidyverse) # ggplot2, dplyr, tidyr, readr, 
                   # purrr, tibble, stringr, forcats
library(viridis)

heads <- rbinom(100000, 50, 0.5)
heads <- data.frame(heads)
heads <- heads %>% mutate(events = ifelse(heads > 35, "> 35", "< 35"))
heads %>%  ggplot(aes(x=heads, fill = events)) + 
  geom_histogram(binwidth = 0.5) + 
  scale_fill_manual(values = c("black", "red")) +
  theme_classic() +
  theme(text = element_text(size = 18),
        legend.position = c(0.85, 0.85)) +
  labs(x = "Number of heads in 50 coin flips")

Even to get 35 or more heads has very low probability. Most of the events will be an integer between 15 and 35.

Binomial probability distributions help us to understand the likelihood of such rare events.

You can also see here a key difference of a binomial distribution with normal distribution is that they can take only discrete values. It is not possible to have 35.5 heads. Although, there are differences, when the sample size is large enough their shape will be similar and normal distributions can be used to estimate binomial probabilities.

So what are all those will be useful for?

Coin flipping expertise may have limited real life applications but let’s give some other examples.

  • Binomial distributions in machine learning

You built a machine learning model with a binary outcome. Let’s say pathological image recognition algorithm for liver cancer that works with 90% accuracy. You tested 100 patients and you want to know your 95% confidence interval? Or your new results showed that your model detected less than 70 patients correctly. Is it possible? Or you should start optimizing your parameters again?

  • Number of patients responding to a treatment.

Let’s say you have a new therapy for cancer which has 10% probability to cure a patient. You have 500 patients which took the drug. The expected number of recovering patients is 50. But you found that 75 patients responded. Is that due to chance or a significant effect? Or you should start looking underlying factors if there is something about the therapy or the patient group?

  • Think about a hospital emergency station.

You are a hospital manager and you want to organize the staff numbers correctly for different weekdays. You know total number of patients came in to a emergency station because of alcohol poisoning in a given time period. You can analyse the distribution of patient numbers for each day of the week. Most likely you will have more such cases in the weekend and you need larger staff.

This will be also true for other businesses. They can use binomial distributions to calculate changes in demand and plan accordingly.

  • If you are running a webserver.

You can allocate your resources better by identifying times when traffic will be higher.

Some other questions in which binomial distributions will come in handy are:

  • Number of people who answered ‘yes’ to a survey question
  • How many games a team will win in one season?
  • Vote counts for a candidate in an election.
  • Number of defective products in a production run.

Binomial distributions are common and they have many applications of real life situations.

Let’s use some real life data to apply our knowledge so far. The data comes from TidyTuesday which I introduced in my last post.

It contains information about Horror Movies released since 2012. And the question I asked was whether horror movies are more likely be released at the 13th each month?

Being in the right mindset for anything gives us a better feeling of it. If you go to the cinema at the 13th to watch a horror movie, do you like it more than other days? We don’t know if this is true but movie makers might be counting on that. I analyzed Horror movies data and calculated number of releases in different days of the month. This is a good example of a binomial probability distribution. Let’s look at the data.

horror_movies <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv")

dim(horror_movies)
## [1] 3328   12
head(horror_movies)
## # A tibble: 6 x 12
##   title genres release_date release_country movie_rating review_rating
##   <chr> <chr>  <chr>        <chr>           <chr>                <dbl>
## 1 Gut ~ Drama~ 26-Oct-12    USA             <NA>                   3.9
## 2 The ~ Horror 13-Jan-17    USA             <NA>                  NA  
## 3 Slee~ Horror 21-Oct-17    Canada          <NA>                  NA  
## 4 Trea~ Comed~ 23-Apr-13    USA             NOT RATED              3.7
## 5 Infi~ Crime~ 10-Apr-15    USA             <NA>                   5.8
## 6 In E~ Horro~ 2017         UK              <NA>                  NA  
## # ... with 6 more variables: movie_run_time <chr>, plot <chr>, cast <chr>,
## #   language <chr>, filming_locations <chr>, budget <chr>

I need some data pre-processing before I can make my visualizations. Dates are given in day:month:year format. I need to split them to individual columns. Also couple of movies do not have the day of the month. I will remove them.

horror_date <-  horror_movies %>% 
                separate(
                  release_date, 
                  c("day", "month", "year"),
                  sep = "-")

horror_date$day <- as.numeric(horror_date$day)

# Remove rows without Date of the month

horror_date <- horror_date %>% filter(day < 32) 

# I am excluding Day 1 from the analysis (Most aggreements starts at 1st)

horror_date_table <- horror_date %>% filter(day > 1)

# Let's check what is the most common day in the month for a horror movie release
horror_date_table <-  horror_date_table %>%
  group_by(day) %>% 
  count() %>% 
  arrange(desc(n))
horror_date_table
## # A tibble: 30 x 2
## # Groups:   day [30]
##      day     n
##    <dbl> <int>
##  1    13   124
##  2    18   119
##  3    25   119
##  4    21   110
##  5    31   107
##  6    28   102
##  7    10   100
##  8    20   100
##  9     5    99
## 10     7    98
## # ... with 20 more rows

Let’s visualize the data.

# Final
horror_date_table$day <- as.character(horror_date_table$day)

my_title <- "Highest number of Horror movies are released at the 13th"

p <- horror_date_table %>% 
    ungroup() %>% 
    mutate(day=fct_reorder(day, n, .desc=TRUE)) %>% 
    ggplot(aes(x=day, y=n)) +
    geom_col(aes(fill=n)) +
    scale_fill_viridis( direction =-1) + 
    theme(
      plot.title = element_text(size=24, color= "black", hjust=0.5, vjust = -1),
      plot.subtitle = element_text(size=36, color= "red", hjust=0.5, vjust = -1),
      panel.background = element_rect(fill = "white"), 
      plot.background = element_rect(fill = "white"),
      panel.grid = element_blank(),
      legend.position = "none", 
      text = element_text(size=18), 
      axis.text.x =element_text(vjust=12, size=17, colour= "white", face= "bold"),
      axis.title.x = element_text(vjust=9.5), 
      axis.text.y=element_blank(),
      axis.ticks= element_blank(), 
      plot.caption = element_text(hjust = 1, vjust = 10)) +
    labs(
          caption= "Source: IMDb, Plot: @dataatomic",
          x = "Day of the Month", 
          y = "Number of movies released",
          title = my_title) +
    geom_label(aes(label = n), size=5, fill="yellow", alpha=0.9) 
p

Is this significant?

In the data, there were 2782 movies associated with a release date. So expected movie release per day is 92 (2782 / 30).

But, we see above that some days are much higher than that. What we want to know is, which days are in the range of random chance and which days there is a significant preference or an aversion to release a movie. So that we can conclude our insight about movie makers.

# Size of our distribution is the total number of movies released 
n_value <- horror_date_table %>% ungroup() %>% summarize(n2 = sum(n))
size <- n_value$n2
size
## [1] 2782
# The probability (= success rate = a given outcome to occur = movie released)
p <-  1/30  # Since it can occur any of the 30 days in a months

You can simulate this 2782 movie release dates events 100.000 times with rbinom function and calculate the mean and variance.

# Simulated statistics
 
estimates <-  rbinom(100000, 2782, 1/30)

# Simulated mean
mean(estimates)
## [1] 92.7603
# Simulated variance
var(estimates)
## [1] 89.4458

The average value is around 92 movies released in one day.

We can also calculate theoretical values by the derived mathematical formulas that define the binomial function:

Mean = size * p
Variance = size * p * (1 – p)

Let’s calculate.

# Theoretical statistics

# Expected mean = size * p
mean_theoretical <- 2782 * 1/30
mean_theoretical
## [1] 92.73333
# Expected Variance = size * p * (1-p)
var_theoretical <- size * 1/30 * (1-1/30)
var_theoretical
## [1] 89.64222

Great, simulated and theoretical values are almost the same. So, I can use my simulations to find out 95% confidence interval which will contain the values that can happen due to random chance.

I will define an interval that contains 95% of probabilities in our simulated distributions. And the values outside will be the ones which were not due to random chance. To do this I need 2.5th and 97.5th quantiles of the distribution.

We can do this by the qbinom() function in r. For example qbinom(0.975, size, p) will return the value which will define the cut off which contains 0.975 of the probabilities. And our confidence interval will be the interval between:

qbinom(0.025, size, p) < Confidence Interval < qbinom(0.975, size, p)

# Boundaries for p values smaller than 0.05

lower <- qbinom(0.975, 2782, 1/30)
lower
## [1] 112
upper <- qbinom(0.025, 2782, 1/30)
upper
## [1] 75
movies <- rbinom(100000, 2782, 1/30)
movies <- data.frame(movies)
movies <- movies %>% mutate(events = ifelse(movies > qbinom(.025, 2782, 1/30) & movies <  qbinom(.975, 2782, 1/30), "95% Conf. Int.", "significant"))
movies %>%  ggplot(aes(x=movies, fill = events)) + 
  geom_histogram(binwidth = 0.5) + 
  scale_fill_manual(values = c("black", "red")) +
  theme_classic() +
  theme(text = element_text(size = 18),
        legend.position = c(0.85, 0.85)) +
  labs(x = "Number of movie releases in a day")

95% of the time, days in one month will have between 75 and 112 movie releases. Higher or lower values than this range can not happen due to random chance according to our binomial distribution.

How to calculate the p-value for a binomial test using pbinom?

Our observed value for the 13th is 124. Above the 97.5th quantile. So it is significant. But what is the exact p value? Let’s define p value first.

P value is the sum of the probability of that event plus the sum of the probabilities of similar events that are equally likely or less likely.

For example in coin flipping, probability of heads is (0.5). If we follow our definition, p value is the sum of the probability of that event (0.5) and similar events which is equally or less likely i.e. tails (0.5). This makes our p value 0.5 + 0.5 = 1.

Similarly, in our horror movie data this will be the sum of the probabilities of getting 124 movie releases or events that are equally probable or rarer.

In R, pbinom function defines the cumulative probabilities. For example, pbinom(124, 2782, 1/30) will give us the cumulative probabilities of any number of movie releases up to 124. By using 1-pbinom(124, 2782, 1/30) we can find the sum of the probabilities with equal or lower chance than having 124.

Thus, p value for getting at least 124 movie release is;

# Probability of getting at least 124 movie releases (e.g. the 13th)

p_val_binom <- 2 * (1 - pbinom(124, 2782, 1/30))
p_val_binom
## [1] 0.001335455

We multiplied by two because same rare events can happen in the left side of our confidence interval as well.

Let’s put those p values on our barplot to highlight the significant days.

# I will add a new column so I can separately define values outside the 95% confidence interval.

horror_date_table <- horror_date_table %>% 
                     mutate(p_val = 2 * (1 - pbinom(n, 2782, 1/30, 
                            lower.tail = ifelse(n>93, TRUE, FALSE)))) %>% 
                     mutate(p_val = cut(p_val, breaks = c(-Inf, 0.001, 0.01, 0.05, Inf), 
                                        labels = c("< 0.001","< 0.01", "< 0.05", "NS")))      

# Visualize the significant days 
p_ci <- horror_date_table %>% 
    ungroup() %>% 
    mutate(day=fct_reorder(day, n, .desc=TRUE)) %>% 
    ggplot(aes(x=day, y=n)) +
    geom_col(aes(fill=p_val)) +
    scale_fill_manual(values = viridis(4))  + 
    theme(
        plot.title = element_text(size=24, color= "black", hjust=0.5, vjust = -1),
        plot.subtitle = element_text(size=36, color= "red", hjust=0.5, vjust = -1),
        panel.background = element_rect(fill = "white"), 
        plot.background = element_rect(fill = "white"),
        panel.grid = element_blank(),
        text = element_text(size=18), 
        axis.text.x =element_text(vjust=12, size=17, colour= "white", face= "bold"),
        axis.title.x = element_text(vjust=9.5), 
        axis.text.y=element_blank(),
        axis.ticks= element_blank(), 
        plot.caption = element_text(hjust = 1, vjust = 10)) +
    labs(
        caption= "",
        x = "Day of the Month", 
        y = "Number of movies released",
        title = "Calculating p values in binomial distributions") +
    geom_label(aes(label = n), size=5, fill="yellow", alpha=0.9) 
p_ci

We performed a hypothesis testing by calculating the p value by using the pbinom() function and found couple of other days where movies are more or less likely to be released.

However, another widely used way to do so is to calculate the mean (the expected probability) of our distribution and its standard deviation and to verify how many standard deviations the observed value is away from the mean (the z score).

Calculating the p-value by normal approximation

When the sample size is large, binomial distributions can be approximated by a normal distribution. To build the normal distribution, I need mean and standard deviation. I can calculate this from the horror movie data.

sample_mean <- horror_date_table %>% ungroup() %>% summarise(n=mean(n))
sample_mean
## # A tibble: 1 x 1
##       n
##   <dbl>
## 1  92.7
p <- 1/30

sample_variance <-  2782 * p * (1-p)
sample_variance
## [1] 89.64222
sample_sd <- sqrt(sample_variance)
sample_sd
## [1] 9.467958

I can calculate the z-score for our observation of 124 movies that are released on the 13th. Simply, z-score is: how many standard deviations an observation is away from the mean. Since 95% of the observations will fall within 1.96 standard deviations from the mean in a normal distribution, a higher z-score will show that our p value is indeed significant.

# Calculate z-score for observation 13th of the month = 124 movies are 
# released
observation <- 124
z_score <- (observation - sample_mean) / sample_sd
z_score
##          n
## 1 3.302367

I can calculate the exact p value by using a normal distribution function pnorm() and the z score we found.

# Calculate the p-value of observing 124 or more movie releases in a day

p_val_nor <- 2 * pnorm(3.302, lower.tail = FALSE)
p_val_nor
## [1] 0.0009599807

As expected, I found similar values (Normal: 0.00095, Binomial: 0.00133) by using an approximation of a normal distribution and by using binomial distributions. Both methods proves that Horror movies are more likely to be released at the 13th.

Future thoughts / Conclusions

Many events in real life can be explained by binomial probability distributions, and they allow us to calculate whether or not the events happened due to random chance and test our hypotheses.

It can be a fun data analysis such as in horror movies, or more serious subjects like testing of new medicines or predicting accuracy of machine learning algorithms detecting diseases.

Until next time!

Serdar

To leave a comment for the author, please follow the link and comment on their blog: Posts | SERDAR KORUR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.