Site icon R-bloggers

Credible intervals with Capybaras, R and Stan (rstan and cmdstanr interfaces)

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R and Shiny Training: If you find this blog to be interesting, please note that I offer personalized and group-based training sessions that may be reserved through Buy me a Coffee. Additionally, I provide training services in the Spanish language and am available to discuss means by which I may contribute to your Shiny project.

< section id="motivation" class="level1">

Motivation

Matsuura’s excellent book Bayesian Statistical Modeling with Stan, R, and Python uses cmdstanr for all the coding examples. You can access the codes from GitHub.

Because I am reading that book, I searched online how to obtain a credible interval with cmdstanr, but I did not find anything with open access about it. Therefore, I decided to write this post to show how to do it with both interfaces.

< section id="problem" class="level1">

Problem

We have a dataset about 100 capybaras (hydrochoerus hydrochaeris). Each capybara has 2 babies, one in each season, with each season being “birth1” and “birth2”. The data is:

< !-- insert capybara image -->

In this data, the codification is “female = 0” and “male = 1”.

baby_capybaras <- data.frame(
    birth1 = c(1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,
               0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,
               1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0,1,0,1,1,1,0,
               1,1,1,1),
    birth2 = c(0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,1,1,
               1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,
               0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1,0,0,0,1,1,1,
               0,0,0,0)
)

Which is the probability that a capybara has a male baby if the first baby is a female?

< section id="solution" class="level1">

Solution

We can calculate posterior beliefs about the conditional probability by using Stan. Before doing computation, let’s assume the births are dependent and that the prior probability is uniform.

The Stan for code for this problem is:

// Input
data {
  int<lower=0> N1;
  int<lower=0> N2;
  int<lower=0> n1;
  int<lower=0> n2;
}

// Parameters
parameters {
  vector<lower=0, upper=1>[2] p;
}

// Model
model {
  n1 ~ binomial(N1, p[1]);
  n2 ~ binomial(N2, p[2]);
}

From R we can either use rstan or cmdstanr. The first one has more documentation, but the second one is faster and compatible with the latest Stan version. I will show how to use both.

< section id="credible-interval-with-rstan" class="level2">

Credible interval with rstan

With rstan we do the following:

# install.packages("rstan")

library(rstan)

mod_rstan <- stan_model("baby-capybaras.stan")

baby_capybaras_list <- list(
    N1 = sum(baby_capybaras$birth1 == 0),
    N2 = sum(baby_capybaras$birth1 == 1),
    n1 = sum(baby_capybaras$birth2[baby_capybaras$birth1 == 0]),
    n2 = sum(baby_capybaras$birth2[baby_capybaras$birth1 == 1])
)

# the seed is for reproducibility
# see https://www.independent.co.uk/life-style/history/42-the-answer-to-life-the-universe-and-everything-2205734.html

fit_rstan <- sampling(
    mod_rstan,
    data = baby_capybaras_list,
    seed = 42, 
    chains = 4,
    refresh = 500,
    warmup = 1000,
    iter = 10000
)

fit_rstan

Inference for Stan model: baby-capybaras.
4 chains, each with iter=10000; warmup=1000; thin=1; 
post-warmup draws per chain=9000, total post-warmup draws=36000.

       mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
p[1]   0.78    0.00 0.06   0.66   0.75   0.79   0.83   0.89 31038    1
p[2]   0.42    0.00 0.07   0.29   0.37   0.41   0.46   0.55 31870    1
lp__ -63.57    0.01 1.01 -66.29 -63.96 -63.26 -62.85 -62.59 16559    1

Samples were drawn using NUTS(diag_e) at Mon Jul  3 14:40:25 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

This implies that the probability of a male birth if the first birth is a female is 0.78.

For the credible interval I will create a separate vector for the percentiles. The reason for this is that I typed 0.0025 instead of 0.025 to verify the computation with rstan by doing the same with cmdstanr.

My typing error made me spend more than one hour checking my code and trying to figure out why the results were different, until I realised I had this:

quantile(extract(fit_rstan, pars = "p")$p[, 1], probs = c(0.025, 0.975))

# versus

quantile(fit_cmdstanr$draws("p", format = "matrix")[, "p[1]"],
    probs = c(0.0025, 0.975))

# please note the additional 0 in the second case

The code for the 95% credible interval with rstan is:

# percentiles 0% + alpha/2 and 100% - alpha/2
# in this case 2.5% and 97.5%
alpha <- 0.05
percentiles <- c(0, 1) + (c(1, -1) * (alpha / 2))

quantile(extract(fit_rstan, pars = "p")$p[, 1], probs = percentiles)

     2.5%     97.5% 
0.6629666 0.8853788
< section id="credible-interval-with-cmdstanr" class="level2">

Credible interval with cmdstanr

With cmdstanr we do the following:

# install.packages("cmdstanr",
#  repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

library(cmdstanr)

# run once and only once after installing the package
# check_cmdstan_toolchain()
# install_cmdstan(cores = 4)

mod_cmdstanr <- cmdstan_model("baby-capybaras.stan")

fit_cmdstanr <- mod_cmdstanr$sample(
  data = baby_capybaras_list,
  seed = 42, 
  chains = 4,
  refresh = 500,
  iter_warmup = 1000,
  iter_sampling = 10000
)

fit_cmdstanr

variable   mean median   sd  mad     q5    q95 rhat ess_bulk ess_tail
    lp__ -63.58 -63.27 1.01 0.74 -65.60 -62.61 1.00    18852    24730
    p[1]   0.78   0.79 0.06 0.06   0.68   0.87 1.00    36314    26137
    p[2]   0.42   0.41 0.07 0.07   0.31   0.53 1.00    39442    28938

The code for the 95% credible interval with cmdstanr is:

quantile(fit_cmdstanr$draws("p", format = "matrix")[, "p[1]"],
    probs = percentiles)

     2.5%     97.5% 
0.6623710 0.8853289

These results are equivalent to the previous part. There is a small difference for the credible interval, but it is due to the different implementations used by rstan and cmdstanr, and the implemented steps such as rounding.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version