Credible intervals with Capybaras, Python and Stan (cmdstanpy interface)

pacha.dev/blog

21 hours ago

[This article was first published on pacha.dev/blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R and Shiny Training: If you find this blog to be interesting, please note that I offer personalized and group-based training sessions that may be reserved through Buy me a Coffee. Additionally, I provide training services in the Spanish language and am available to discuss means by which I may contribute to your Shiny project.

< section id="motivation" class="level1">

Motivation

Matsuura’s excellent book Bayesian Statistical Modeling with Stan, R, and Python uses cmdstanr for all the coding examples. You can access the codes from GitHub.

Because I am reading that book, and in the previous post I explained how to obtain a credible interval with rstan and cmdstanr, I will show how to do the same with cmdstanpy.

< section id="problem" class="level1">

Problem

We have a dataset about 100 capybaras (hydrochoerus hydrochaeris). Each capybara has 2 babies, one in each season, with each season being “birth1” and “birth2”. The data is:

< !-- insert capybara image -->

In this data, the codification is “female = 0” and “male = 1”.

import pandas as pd

baby_capybaras = pd.DataFrame({
    "birth1": [1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,1,1,
               0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,1,0,0,
               1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0,1,0,1,1,1,0,
               1,1,1,1],
    "birth2": [0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,1,1,
               1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,
               0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1,0,0,0,1,1,1,
               0,0,0,0]
})

Which is the probability that a capybara has a male baby if the first baby is a female?

< section id="solution" class="level1">

Solution

We can calculate posterior beliefs about the conditional probability by using Stan. Before doing computation, let’s assume the births are dependent and that the prior probability is uniform.

The Stan for code for this problem is:

// Input
data {
  int<lower=0> N1;
  int<lower=0> N2;
  int<lower=0> n1;
  int<lower=0> n2;
}

// Parameters
parameters {
  vector<lower=0, upper=1>[2] p;
}

// Model
model {
  n1 ~ binomial(N1, p[1]);
  n2 ~ binomial(N2, p[2]);
}

From Python we use cmdstanpy, which is similar to cmdstanr for R.

With cmdstanpy we do the following:

# conda activate blog
# pip install cmdstanpy

from cmdstanpy import cmdstan_path, install_cmdstan, CmdStanModel

# run once and only once after installing the package
# install_cmdstan()

mod_cmdstanpy = CmdStanModel(stan_file='baby-capybaras.stan')

baby_capybaras_list = {
    'N1': baby_capybaras['birth1'].value_counts()[0],
    'N2': baby_capybaras['birth1'].value_counts()[1],
    'n1': baby_capybaras['birth2'][baby_capybaras['birth1'] == 0].sum(),
    'n2': baby_capybaras['birth2'][baby_capybaras['birth1'] == 1].sum(),
}

# the seed is for reproducibility
# see https://www.independent.co.uk/life-style/history/42-the-answer-to-life-the-universe-and-everything-2205734.html

fit_cmdstanpy = mod_cmdstanpy.sample(
  data = baby_capybaras_list,
  seed = 42, 
  chains = 4,
  refresh = 500,
  iter_warmup = 1000,
  iter_sampling = 10000
)

fit_cmdstanpy.summary()

           Mean      MCSE    StdDev         5%        50%        95%    N_Eff  \
lp__ -63.578200  0.007388  1.013660 -65.602100 -63.267200 -62.610000  18824.6   
p[1]   0.784325  0.000298  0.057167   0.684343   0.788070   0.872173  36695.5   
p[2]   0.415275  0.000338  0.067206   0.306533   0.414058   0.528851  39566.4   

      N_Eff/s     R_hat  
lp__  12858.3  1.000140  
p[1]  25065.3  0.999977  
p[2]  27026.2  0.999925

This implies that the probability of a male birth if the first birth is a female is 0.78.

The code for the 95% credible interval with cmdstanpy is:

# percentiles 0% + alpha/2 and 100% - alpha/2
# in this case 2.5% and 97.5%
alpha = 0.05
percentiles = [0 + alpha / 2, 1 - alpha / 2]

fit_cmdstanpy.draws_pd()['p[1]'].quantile([percentiles[0], percentiles[1]])

0.025    0.662371
0.975    0.885329
Name: p[1], dtype: float64

These results are equivalent to the ones obtained with cmdstanr.

To leave a comment for the author, please follow the link and comment on their blog: pacha.dev/blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

Problem

Solution

Related