Fun with the proto package: building an MCMC sampler for Bayesian regression
[This article was first published on Last Resort Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The proto package is my latest favourite R goodie. It brings prototype-based programming to the R language – a style of programming that lets you do many of the things you can do with classes, but with a lot less up-front work. Louis Kates and Thomas Petzoldt provide an excellent introduction to using proto in the package vignette.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As a learning exercise I concocted the example below involving Bayesian logistic regression. It was inspired by an article on Matt Shotwell’s blog about using R environments rather than lists to store the state of a Markov Chain Monte Carlo sampler. Here I use proto to create a parent class-like object (or trait in proto-ese) to contain the regression functions and create child objects to hold both data and results for individual analyses.
First here’s an example session…
# Make up some data with a continuous predictor and binary response nrec <- 500 x <- rnorm(nrec) y <- rbinom(nrec, 1, plogis(2 - 4*x)) # Predictor matrix with a col of 1s for intercept pred <- matrix(c(rep(1, nrec), x), ncol=2) colnames(pred) <- c("intercept", "X") # Load the proto package library(proto) # Use the Logistic parent object to create a child object which will # hold the data and run the regression (the $ operator references # functions and data within a proto object) lr <- Logistic$new(pred, y) lr$run(5000, 1000) # lr now contains both data and results str(lr) proto object $ cov : num [1:2, 1:2] 0.05 -0.0667 -0.0667 0.1621 ..- attr(*, "dimnames")=List of 2 $ prior.cov: num [1:2, 1:2] 100 0 0 100 $ prior.mu : num [1:2] 0 0 $ beta : num [1:5000, 1:2] 2.09 2.09 2.09 2.21 2.21 ... ..- attr(*, "dimnames")=List of 2 $ adapt : num 1000 $ y : num [1:500] 0 1 1 1 1 1 1 1 1 1 ... $ x : num [1:500, 1:2] 1 1 1 1 1 1 1 1 1 1 ... ..- attr(*, "dimnames")=List of 2 parent: proto object # Use the Logistic summary function to tabulate and plot results lr$summary() From 5000 samples after 1000 iterations burning in intercept X Min. :1.420 Min. :-5.296 1st Qu.:1.840 1st Qu.:-3.915 Median :2.000 Median :-3.668 Mean :1.994 Mean :-3.693 3rd Qu.:2.128 3rd Qu.:-3.455 Max. :2.744 Max. :-2.437
And here's the code for the Logistic trait...
Logistic <- proto() Logistic$new <- function(., x, y) { # Creates a child object to hold data and results # # x - a design matrix (ie. predictors # y - a binary reponse vector proto(., x=x, y=y) } Logistic$run <- function(., niter, adapt=1000) { # Perform the regression by running the MCMC # sampler # # niter - number of iterations to sample # adapt - number of prior iterations to run # for the 'burning in' period require(mvtnorm) # Set up variables used by the sampler .$adapt <- adapt total.iter <- niter + adapt .$beta <- matrix(0, nrow=total.iter, ncol=ncol(.$x)) .$prior.mu <- rep(0, ncol(.$x)) .$prior.cov <- diag(100, ncol(.$x)) .$cov <- diag(ncol(.$x)) # Run the sampler b <- rep(0, ncol(.$x)) for (i in 1:total.iter) { b <- .$update(i, b) .$beta[i,] <- b } # Trim the results matrix to remove the burn-in # period if (.$adapt > 0) { .$beta <- .$beta[(.$adapt + 1):total.iter,] } } Logistic$update <- function(., it, beta.old) { # Perform a single iteration of the MCMC sampler using # Metropolis-Hastings algorithm. # Adapted from code by Brian Neelon published at: # http://www.duke.edu/~neelo003/r/ # # it - iteration number # beta.old - vector of coefficient values from # the previous iteration # Update the coefficient covariance if we are far # enough through the sampling if (.$adapt > 0 & it > 2 * .$adapt) { .$cov <- cov(.$beta[(it - .$adapt):(it - 1),]) } # generate proposed new coefficient values beta.new <- c(beta.old + rmvnorm(1, sigma=.$cov)) # calculate prior and current probabilities and log-likelihood if (it == 1) { .$..log.prior.old <- dmvnorm(beta.old, .$prior.mu, .$prior.cov, log=TRUE) .$..probs.old <- plogis(.$x %*% beta.old) .$..LL.old <- sum(log(ifelse(.$y, .$..probs.old, 1 - .$..probs.old))) } log.prior.new <- dmvnorm(beta.new, .$prior.mu, .$prior.cov, log=TRUE) probs.new <- plogis(.$x %*% beta.new) LL.new <- sum(log(ifelse(.$y, probs.new, 1-probs.new))) # Metropolis-Hastings acceptance ratio (log scale) ratio <- LL.new + log.prior.new - .$..LL.old - .$..log.prior.old if (log(runif(1)) < ratio) { .$..log.prior.old <- log.prior.new .$..probs.old <- probs.new .$..LL.old <- LL.new return(beta.new) } else { return(beta.old) } } Logistic$summary <- function(., show.plot=TRUE) { # Summarize the results cat("From", nrow(.$beta), "samples after", .$adapt, "iterations burning in\n") base::print(base::summary(.$beta)) if (show.plot) { par(mfrow=c(1, ncol(.$beta))) for (i in 1:ncol(.$beta)) { plot(density(.$beta[,i]), main=colnames(.$beta)[i]) } } }
Now that's probably not the greatest design in the world, but it only took me a few minutes to put it together and it's incredibly easy to modify or extend. Try it !
Thanks to Brian Neelon for making his MCMC logistic regression code available (http://www.duke.edu/~neelo003/r/).
To leave a comment for the author, please follow the link and comment on their blog: Last Resort Software.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.