Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The previous post outlined the general strategy of writing a MH within Gibbs sampler by breaking the code into two levels: a high level shell and a series of lower-level samplers which do the actual work.
This post discusses the first of the lower-level samplers, the sampler for the person ability parameters. The chapter gives a complete treatment of how to write the sampler, please read it for the details. Here, we summarize the development of the sampler, check that the sampler code is correct, and give two examples of how bugs might be detected. The final bug will motivate the next post where we will refactor this specific person ability sampler into a general MH sampler.
The person ability sampler
The person ability sampler will use the Metropolis-Hastings (MH) algorithm to sample from the complete conditional. See the chapter for details on how (and why!) to write the code presented in this section.
In a previous post we derived the complete conditional as:
begin{equation}
f(theta_p|text{rest}) propto
prod_{i=1}^I
pi_{pi}^{u_{pi}} (1 – pi_{pi})^{1-u_{pi}}
f_text{Normal}(theta_p| 0,sigma_theta^2) label{eq:thetacc}
end{equation}
where
The likelihood term in Equation ref{eq:thetacc} log.prob
function to calculate it on the log scale:
## log.prob ## ## Output : A matrix such that the ## column sums are item log-likelihood values, and ## row sums are person log-likelihood values. log.prob <- function(U.data, th, a, b) { ## Step 1: Build the matrix of probabilities ## (see Post 2 for details) term.1 <- outer(th, a) term.2 <- matrix(rep(a*b, P.persons), nrow=P.persons, byrow=TRUE) P.prob <- plogis(term.1 - term.2) #### 1/(1 + exp(term.2 - term.1)) ## Step 2: Build the likelihood from Equation 1 on the log scale log.bernoulli <- log(P.prob^U.data) + log((1-P.prob)^(1-U.data)) return(log.bernoulli) }
The chapter gives details on how to implement the MH sampler. A more highly commented version of the code is given here:
sample.th <- function(U.data, old) { ## Access the current (soon to be old) state of the chain ## N.B. This is vector. All operations in this function ## are on the complete vector instead of element-at-a-time th.old <- old$th ## Draw the MH proposal ## ... get the MH proposal variance MH.th <- old$MH$th ## ... get the number of persons in the data P.persons <- length(th.old) ## ... draw a new person ability parameter for each person ## centered around their previous (th.old) parameter, with ## a fixed variance (MH.th) th.star <- rnorm(P.persons,th.old,MH.th) ## Calculate the probability of accepting the proposal draw ## ... N.B. All calculations are done on the log scale to keep ## numerical accuracy when dealing with small numbers ## ... The acceptance probability is a function of four things: ## ## ... ... 1: The complete conditional at the proposal log.cc.star <- ( ## ... ... likelihood term apply(log.prob(U.data, th.star, old$a, old$b),1,sum) ## ... ... prior term + log(dnorm(th.star,0,sqrt(old$s2)))) ## ## ... ... 2: The complete conditional at the old value log.cc.old <- ( ## ... ... likelihood term apply(log.prob(U.data, th.old, old$a, old$b),1,sum) ## ... ... prior term + log(dnorm(th.old,0,sqrt(old$s2)))) ## ## ... ... 3: The MH proposal density at the proposal log.prop.star <- log(dnorm(th.star,th.old,MH.th)) ## ## ... ... 4: The MH proposal density at the old value log.prop.old <- log(dnorm(th.old,th.star,MH.th)) ## ## ... Calculate the acceptance probability on the log scale log.alpha.star <- pmin(log.cc.star + log.prop.old - log.cc.old - log.prop.star,0) ## Accept or reject the proposals by flipping coins with the ## calculated acceptance probabilities acc.new < - ifelse(log(runif(P.persons))
Please see the chapter for additional details.
Running the sampler
We can run the sampler as follows:
## Load the necessary code from this and previous posts source("http://mcmcinirt.stat.cmu.edu/setup/post-4.R") ## Set the seed to keep results reproducible set.seed(314159) ## Run the sampler with overdispersed theta.abl ## values and the true, simulated values for the other ## parameters. run.B <- run.chain.2pl( ## No burnin, keep 1000 iterations, do not thin M.burnin = 0, M.keep = 1000, M.thin = 1, ## Use the fake data simulated in Post #2 U.data = U, ## Pass in the hyperpriors from Post #1 hyperpars = hyperpars, ## Generate starting values uniformly from -5 to 5 th.init = runif( length(theta.abl), min=-5, max=5 ), ## Set the other parameters to their true values ## from Post #2 a.init = a.disc, b.init = b.diff, s2.init = sig2.theta, ## Set the person ability proposal variance to 1, ## the others to NULL MH.th=1, MH.a=NULL, MH.b=NULL, ## Output the acceptance rates verbose=TRUE) ## [1] 100 ## [1] 200 ## ... snip ... ## [1] 1000 ## Average acceptance rates: ## theta.abl: 0.481 ## a.disc: NaN ## b.diff: NaN ## Convert to coda run.B.mcmc <- mcmc( t(run.B) )
Note that the acceptance rate reported by the sampler is about 48%, which is near the optimal rate. I achieved this by changing the value for MH.th
until it was "just right". The tuning runs are not shown (see Post 9 for how to tune the sampler).
Testing the sampler
Using the coda
library and the get.2pl.params
function from Post 3 (both of which were loaded by the source("http://mcmcinirt.stat.cmu.edu/setup/post-4.R")
line in the code above), we can quickly visualize the trace plots for
using the default plot
function:
## N.B. ylim specifies the scale on the y-axis. plot( run.B.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)], density=FALSE, ylim=c(-5,5), smooth=TRUE )
From the trace plots, it appears that each chain quickly converges to the mode from its starting value. We can test that the chains are getting the right answers by calculating the average values of the chains (Expected A Posteriori estimates; EAP estimates),
theta.eap.B <- apply( run.B.mcmc[, get.2pl.params(1:P.persons, NULL, NULL, NULL)], MARGIN=2, mean )
and graphically comparing them with the simulated true values:
To generate the graph, we define a special visualization function
## We'll be doing these checks a lot, so let's make a function check.sampler.graph <- function(true.values, estimated.values, desc, ylab, col) { par(mfrow=c(1,2)) plot( true.values, estimated.values, xlab='True values', ylab=ylab, main=paste('Parameter recovery:',desc), col=col) abline(0,1) plot( ecdf( true.values ), xlab=deparse(substitute(true.values)), main=paste('Empirical CDFs:',desc) ) plot( ecdf( estimated.values ), col=col, lty='dashed', add=TRUE ) legend( 'topleft', c('True values', ylab), col=c('black', col), lty=c('solid','dashed'), bg='white') ## Add in the KS test legend( 'bottomright', paste("Kolmogorov-Smirnov p value:", round( ks.test( true.values, estimated.values )$p.value, 2)), bg='white') par(mfrow=c(1,1)) }
and call it for this specific run of the sampler:
## Run the function for this example: check.sampler.graph(theta.abl, theta.eap.B, 'Person Ability', 'EAP estimates', 'blue') ## Parameter recovery looks a little biased ## Empirical CDF looks very close
Both the parameter recovery scatter plot and the empirical CDF plot suggest that the sampler is properly recovering the simulated values. This is good news; it suggests that we implemented the sampler correctly.
What if there is a bug in the code?
Bug 1: Good sampler properties, but wrong answers
In my experience, a common bug in writing a sampler is not calculating the log-likelihood correctly. Here we introduce a bug in the calculation of the log-likelihood to see what happens with the sampler.
## Introduce a bug into the calculation of the log-likelihood log.prob.buggy <- function(U.data, th, a, b) { term.1 <- outer(th, b) ## This should be 'a' not 'b' term.2 <- matrix(rep(a*b, P.persons), nrow=P.persons, byrow=TRUE) P.prob <- plogis(term.1 - term.2) log.bernoulli <- log(P.prob^U.data) + log((1-P.prob)^(1-U.data)) return(log.bernoulli) }
Now we swap in the buggy log.prob
function, make a buggy run of the sampler, and the swap out the buggy log.prob
function for the correct one:
## Set the seed to keep results reproducible set.seed(314159) ## Swap in the buggy log likelihood log.prob.correct <- log.prob log.prob <- log.prob.buggy ## Make a buggy run run.B.buggy <- run.chain.2pl( ## No burnin, keep 1000 iterations, do not thin M.burnin = 0, M.keep = 1000, M.thin = 1, ## Use the fake data simulated in Post #2 U.data = U, ## Pass in the hyperpriors from Post #1 hyperpars = hyperpars, ## Generate starting values uniformly from -5 to 5 th.init = runif( length(theta.abl), min=-5, max=5 ), ## Set the other parameters to their true values ## from Post #2 a.init = a.disc, b.init = b.diff, s2.init = sig2.theta, MH.th=1, MH.a=NULL, MH.b=NULL, verbose=TRUE) ## Swap out the buggy sampler with the correct one log.prob <- log.prob.correct ## Convert to coda run.B.buggy.mcmc <- mcmc( t(run.B.buggy) )
Now, the first nine person ability parameters still converge quickly to their mode:
|
plot( run.B.buggy.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)], density=FALSE, ylim=c(-5,5), smooth=TRUE ) |
However, those modes do not correspond to the true values:
Lesson learned from bug #1
Sometimes the sampler will converge well to a global mode, but that mode will be the wrong answer because there is an error in the way that the likelihood (or prior) is being calculated. Only a check with simulated data -- where the true values are known -- will be able to spot this bug.
Bug 2: Terrible sampler properties, but correct answers
In my experience, a common bug in writing a sampler is making a small error in how the Metropolis-Hastings algorithm is implemented. Here we introduce a bug in the calculation of the acceptance probability:
sample.th.buggy <- function(U.data, old) { th.old <- old$th MH.th <- old$MH$th P.persons <- length(th.old) th.star <- rnorm(P.persons,th.old,MH.th) log.cc.star <- apply( log.prob(U.data, th.star, old$a, old$b),1,sum) + log(dnorm(th.star,0,sqrt(old$s2))) log.cc.old <- apply( log.prob(U.data, th.old, old$a, old$b),1,sum) + log(dnorm(th.old ,0,sqrt(old$s2))) log.prop.star <- log(dnorm(th.star,th.old,MH.th)) log.prop.old <- log(dnorm(th.old,th.star,MH.th)) ## Bug: add log.prop.star, instead of subtract log.alpha.star <- pmin(log.cc.star + log.prop.old - log.cc.old + log.prop.star,0) acc.new < - ifelse(log(runif(P.persons))
Since the bug involves the acceptance rate, we must adjust the variance of the Metropolis-Hastings proposal to get reasonable acceptance rates:
## Set the seed to keep results reproducible set.seed(314159) ## Swap in sample.th.correct <- sample.th sample.th <- sample.th.buggy ## Make a buggy run run.B.buggy.th <- run.chain.2pl( ## No burnin, keep 1000 iterations, do not thin M.burnin = 0, M.keep = 1000, M.thin = 1, ## Use the fake data simulated in Post #2 U.data = U, ## Pass in the hyperpriors from Post #1 hyperpars = hyperpars, ## Generate starting values uniformly from -5 to 5 th.init = runif( length(theta.abl), min=-5, max=5 ), ## Set the other parameters to their true values ## from Post #2 a.init = a.disc, b.init = b.diff, s2.init = sig2.theta, ## Note that MH.th is 0.44 instead of 1 MH.th=0.44, MH.a=NULL, MH.b=NULL, verbose=TRUE) ## Average acceptance rates: ## theta.abl: 0.443 ## Swap out the buggy sampler with the correct one sample.th <- sample.th.correct ## Convert to coda run.B.buggy.th.mcmc <- mcmc( t(run.B.buggy.th) )
Again, the first nine person ability parameters still converge quickly to their modes:
|
plot( run.B.buggy.th.mcmc[, get.2pl.params(1:9,NULL,NULL,NULL)], density=FALSE, ylim=c(-5,5), smooth=TRUE ) |
And the chains even recover the true values:
So, if the chains converge quickly to their modes and recover their true values, in what sense is this a bug?
This bug is about efficiency. Note that the buggy chain has much higher auto-correlation than the correct chain, even though both acceptance rates are around 0.44.
|
par(mfrow=c(1,2)) acf( run.B.mcmc[, 'theta.abl 1'], main="ACF: correct sampler", col='blue') acf( run.B.buggy.th.mcmc[, 'theta.abl 1'], main="ACF: buggy sampler", col='purple') par(mfrow=c(1,1)) |
Since the autocorrelation is so high in the buggy sampler, there is less information about the posterior distribution in the chain. One metric that is both simple to calculate and to interpret is the effective sample size. This is (roughly) the number of draws from an independent sampler that would have the same amount of information as the (in this case, 1,000) draws from the correlated sampler. In R, we can calculate an effective sample size for each
Note that the 1,000 draws from the correct sampler is worth about 200 draws from a hypothetical independent sampler, while the 1,000 draws from the buggy sampler is worth only about 50. Therefore, this bug does affect our ability to draw inferences -- mainly because our standard error of estimation will be quite a bit worse.
Lesson learned from bug #2
Sometimes a sampler will appear to recover the true values, but the sampler will be very inefficient. This could be due to a mistake in the algorithm. Unfortunately, detecting these bugs is quite difficult.
In the next post, we give a partial solution to avoiding such bugs.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.