understanding computational Bayesian statistics

Posted on October 9, 2011 by xi'an in R bloggers | 0 Comments

[This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have just finished reading this book by Bill Bolstad (University of Waikato, New Zealand) which a previous ‘Og post pointed out when it appeared, shortly after our Introducing Monte Carlo Methods with R. My family commented that the cover was nicer than those of my own books, which is true. Before I launch into a review, let me warn the ‘Og reader that, as an author of three books on computational Bayesian statistics, I cannot be very objective on the topic: I do favour the way we approached Bayesian computational methods and, after reading Bolstad’s Understanding computational Bayesian statistics, would still have written the books the way we did. Be warned, thus.

Understanding computational Bayesian statistics is covering the basics of Monte Carlo and (fixed dimension) Markov Chain Monte Carlo methods, with a fair chunk dedicated to prerequisites in Bayesian statistics and Markov chain theory. Even though I have only glanced at the table of contents of Bolstad’s Introduction to Bayesian Statistics [using almost the same nice whirl picture albeit in bronze rather than cobalt], it seems to me that the current book is the continuation of the earlier one, going beyond the Binomial, Poisson, and normal cases, to cover generalised linear models, via MCMC methods. (In this respect, it corresponds to Chapter 4 of Bayesian Core.) The book is associated with Minitab macros and an R package (written by James Curran), Bolstad2, in continuation of Bolstad, written for Introduction to Bayesian Statistics. Overall, the level of the book is such that it should be accessible to undergraduate students, MCMC methods being reduced to Gibbs, random walk and independent Metropolis-Hastings algorithms, and convergence assessments being done via autocorrelation graphs, the Gelman and Rubin (1992) intra-/inter-variance criterion, and a forward coupling device. The illustrative chapters cover logistic regression (Chap. 8), Poisson regression (Chap. 9), and normal hierarchical models (Chap. 10). Again, the overall feeling is that the book should be understandable to undergraduate students, even though it may make MCMC seem easier than it is by sticking to fairly regular models. In a sense, it is more a book of the [roaring MCMC] 90′s in that it does not incorporate advances from 2000 onwards (as seen from the reference list) like adaptive MCMC and the resurgence of importance sampling via particle systems and sequential Monte Carlo.

“Since we are uncertain about the true values of the parameters, in Bayesian statistics we will consider them to be random variables. This contrasts with the frequentist idea that the parameters are fixed but unknown constants.” W. Bolstad, p.3

To get into more details, I find the book introduction to Bayesian statistics (Chap. 1) somehow unbalanced with statements like the above and like “statisticians have long known that the Bayesian approach offered clear cut advantages over the frequentist approach” (p.1) [which makes one wonder why there is any frequentist left!], or “clearly, the Bayesian approach is more straightforward [than the frequentist p-value]” (p.53). because antagonistic presentations are likely to be lost to the neophyte. (I also disagree with the statement that for a Bayesian, there is no fixed value for the parameter!) The statement that the MAP estimator is associated with the 0-1 loss function (footnote 4, p.10) is alas found in many books and papers, thus cannot truly be blamed on the author. The statement that ancillary statistics “only work in exponential families” (footnote 5, p.13) is either unclear or wrong. The discussion about Bayesian inference in the presence of nuisance parameters (pp.15-16) is also confusing: “the Bayesian posterior density of θ₁ found by marginalizing θ₂ out of the joint posterior density, and the profile likelihood function of θ₁ turn out to have the same shape” (p.15) [under a flat prior] sounds wrong to me.

“It is not possible to do any inference about the parameter θ from the unscaled posterior.” W. Bolstad, p.25

The chapter about simulation methods (Chap. 2) contains a mistake that someone might deem of little importance. However, I do not and here it is: sampling-important-resampling is presented as an exact simulation method (p.34), omitting the bias due to normalising the importance weights.

The chapter on conjugate priors (Chap. 4), although fine, feels as if it does not belong to this book but should rather be in the previous Bolstad’s Introduction to Bayesian Statistics. Esp. as it is on the long side. The following Chap. 5 gives an introduction to Markov chain theory in the finite state case, with a nice illustration on the differences in convergence time through two 5×5 matrices. (But why do we need six decimals?!)

“MCMC methods are more efficient than the direct [simulation] procedures for drawing samples from the posterior when we have a large number of parameters.” W. Bolstad, p.127

MCMC methods are presented through two chapters, the second one being entitled Statistical inference from a Markov chain Monte Carlo sample” (Chap. 7), which is a neat idea to cover the analysis of an MCMC output. The presentation is mainly one-dimensional, which makes the recommendation to use independent Metropolis-Hastings algorithms found throughout the book [using a t proposal based on curvature at the mode] more understandable if misguided. The presentation of the blockwise Metropolis-Hastings algorithm of Hastings through the formula (p.145)

$P(\theta,A)=\prod_{j=1}^J P_j(\theta_j,A_j|\theta_{-j})$

is a bit confusing as the update of the conditioners in the conditional kernels is not indicated. (The following algorithm is correct, though.) I also disliked the notion that “the sequence of draws from the chain (..) is not a random sample” (p.161) because of the correlation: the draws are random, if not independent… This relates to the recommendation of using heavy thin-in with a gap that “should be the same as the burn-in time” (p.169), which sounds like a waste of simulation power, as burn-in and thin-in of a Markov chain are two different features. The author disagrees with the [my] viewpoint that keeping all the draws in the estimates improves on the precision: e.g., “one school considers that you should use all draws (…) However, it is not clear how good this estimate would be” (p.168) and “values that were thinned out wouldn’t be adding very much to the precision” (p.169). I did not see any mention made of effective sample size and the burn-in size is graphically determined via autocorrelation graphs, Gelman-Rubin statistics, and a rather fantastic use of coupling from the past (pp.172-174). (In fact, the criterion is a forward coupling device that only works for independent chains.)

“We should alway use proper priors in the hierarchical model, particularly for scale parameters. When improper priors are used (…) overall the posterior is improper.” W. Bolstad, p.257.

The final chapters apply MCMC methods to logistic (Chap. 8) and Poisson regressions (Chap. 9), again using an independent proposal in the Metropolis-Hastings algorithm. (Actually, we also used a proposal based on the MLE solutions for the logistic regression in Introducing Monte Carlo Methods with R, however it was in an importance sampling illustration for Chapter 4.) It is a nice introduction to handling generalised linear models with MCMC. The processing of the selection of variables (p.195-198 and pp.224-226) could have been done in a more rigorous manner, had Bayes factors been introduced. It is also a nice idea to conclude with Gibbs sampling applied to hierarchical models (Chap. 10), a feature missing in the first edition of our Bayesian Core, however the chapter crucially misses an advanced example, like mixed linear models. This chapter covers the possibly misbehaviour of posteriors associated with improper priors, with a bit too strong of a warning (see above), and it also unnecessarily [in my opinion] goes into a short description of the empirical Bayes approach (pp.245-247).

The style of Understanding computational Bayesian statistics is often repetitive, sentences from early paragraphs of a chapter being repeated verbatim a few pages later. While the idea of opposing likelihood-based inference to Bayesian inference by an illustration through a dozen graphs (Chap. 1) is praiseworthy, I fear the impact is weakened by the poor 3-D readability of the graphs. Another praiseworthy idea is the inclusion of a “Main points” section at the end of each chapter; however, they should have been more focused in my opinion. Maybe the itemized presentation did not help.

Inevitably (trust me!), there are typing mistakes in the book and they will most likely be corrected in a future printing/edition. I am however puzzled by the high number of “the the”, or the misspelling (p.261) of Jeffreys‘ prior into Jeffrey’s prior (maybe a mistake from the copy-editor?). (A few normal densities are missing a ½ on p.247, by the way.)