Fitting a distribution in Stan from scratch

[This article was first published on mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week the French National Institute of Health and Medical Research (Inserm) organised with the Stan Group a training programme on Bayesian Inference with Stan for Pharmacometrics in Paris.

Daniel Lee and Michael Betancourt, who run the course over three days, are not only members of Stan’s development team, but also excellent teachers. Both were supported by Eric Novik, who gave an Introduction to Stan talk at the Paris Dataiku User Group last week as well.

Eric Kramer (Dataiku), Daniel Lee, Eric Novik & Michael Betancourt (Stan Group)

I have been playing around with Stan on and off for some time, but as Eric pointed out to me, Stan is not that kind of girl(boy?). Indeed, having spent three days working with Stan has revitalised my relationship. Getting down to the basics has been really helpful and I shall remember, Stan is not drawing samples from a distribution. Instead, it is calculating the joint distribution function (in log space), and evaluating the probability distribution function (in log space).

Thus, here is a little example of fitting a set of random numbers in R to a Normal distribution with Stan. Yet, instead of using the built-in functions for the Normal distribution, I define the log probability function by hand, which I will use in the model block as well, and even generate a random sample, starting with a uniform distribution. However, I do use pre-defined distributions for the priors.

Why do I want to do this? This will be a template for the day when I have to use a distribution, which is not predefined in Stan, e.g. the actuar package has some interesting candidates.

functions {
// Define log probability density function
real myNormal_lpdf(real y, real mu, real sigma) {
return -log(2 * pi()) / 2 - log(sigma)
- square(mu - y) / (2 * sigma^2);
}
}
data {
int N;
real y[N];
}
parameters {
real mu;
real<lower = 0> sigma;
}
model {
// Priors
mu ~ normal(0, 10);
sigma ~ cauchy(0,10);
// Likelihood
for(n in 1:N){
target += myNormal_lpdf(y[n] | mu, sigma);
}
}
generated quantities{
real y_ppc;
{
real x;
x = uniform_rng(0, 1);
// Phi is the probit function in Stan, the CDF of the standardised Normal N(0,1)
// inv_Phi is the quantile function of the standardised Normal N(0,1)
y_ppc = mu + sigma * inv_Phi(x);
}
}

Testing

I start off by generating fake data, a sample of 100 random numbers drawn from a Normal distribution with a mean of 4 and a standard deviation of 2. Note, the sample mean of the 100 figures is 4.2 and not 4.
Histogram of 100 random numbers drawn from N(4,2).
I then use the Stan script to fit the data, i.e. to find the the parameters μ and σ, assuming that the data was generated by a Gaussian process.
library(rstan)
library(MASS)
set.seed(1)
y <- rnorm(100, 4, 2)
truehist(y, col="#B2001D")
lines(density(y), col="skyblue", lwd=2)
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4294 3.0120 4.2280 4.2180 5.3830 8.8030
ret <- stanc(file="NormaLDistribution.stan") # Check Stan file
ret_sm <- stan_model(stanc_ret = ret) # Compile Stan code
fit <- sampling(ret_sm, warmup=100, iter=600, seed=1,
data=list(y, N=length(y)))
stan_trace(fit, inc_warmup = TRUE)
stan_hist(fit)
print(fit, probs=c(0.025, 0.5, 0.975))
## Inference for Stan model: NormaLDistribution.
## 4 chains, each with iter=600; warmup=100; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## mu 4.22 0.01 0.17 3.88 4.22 4.53 771 1
## sigma 1.82 0.00 0.13 1.57 1.81 2.10 2000 1
## y_ppc 4.22 0.04 1.86 0.70 4.21 7.88 2000 1
## lp__ -200.32 0.03 0.97 -203.02 -200.02 -199.41 1016 1
##
## Samples were drawn using NUTS(diag_e) at Mon Sep 26 07:25:09 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
summary(extract(fit, "y_ppc")[["y_ppc"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.483 2.959 4.213 4.222 5.451 10.540
plot(ecdf(y), main="Posterior predictive check")
lines(ecdf(extract(fit, "y_ppc")[["y_ppc"]]), col="#B2001D")

Traceplot of 4 chains, including warm-up phase
Histograms of posterior parameter and predictive samples
Comparison of the emperical distributions
The posterior parameter distributions include both μ and σ in the 95% credible interval. The distribution of posterior predictive check (y_ppc) is wider, taking into account the uncertainty of the parameters. The interquartile range and mean of my initial fake data and the sample of the posterior predictive distribution look very similar. That’s good, my model generates data, which looks like the original data.

Bayesian Mixer Meetup

Btw, tonight we have the 4th Bayesian Mixer Meetup in London.

Session Info

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.12 (Sierra)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base     

other attached packages:
[1] MASS_7.3-45 rstan_2.12.1 StanHeaders_2.12.0 ggplot2_2.1.0     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      codetools_0.2-14 digest_0.6.10    grid_3.3.1      
 [5] plyr_1.8.4       gtable_0.2.0     stats4_3.3.1     scales_0.4.0    
 [9] labeling_0.3     tools_3.3.1      munsell_0.4.3    inline_0.3.14   
[13] colorspace_1.2-6 gridExtra_2.2.1

To leave a comment for the author, please follow the link and comment on their blog: mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)