Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Stochastic search variable selection (SSVS) identifies promising subsets of multiple regression covariates via Gibbs sampling (George and McCulloch 1993). Here’s a short SSVS demo with JAGS and R.
Assume we have a multiple regression problem:
We suspect only a subset of the elements of $\boldsymbol{\beta}$ are non-zero, i.e. some of the covariates have no effect.
Assume $\boldsymbol{\beta}$ arises from one of two normal mixture components, depending on a latent variable $\gamma_i$:
$\tau_i$ is positive but small s.t. $\beta_i$ is close to zero when $\gamma_i = 0$. $c_i$ is large enough to allow reasonable deviations from zero when $\gamma_i = 1$. The prior probability that covariate $i$ has a nonzero effect is $Pr(\gamma_i = 1) = p_i$. Important subtleties about priors are covered in George and McCulloch (1993) and elsewhere.
Let’s simulate a dataset in which some covariates have strong effects on the linear predictor, and other don’t: suppose we have 20 candidate variables, but only 60 observations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ncov <- 20 nobs <- 60 var_beta <- .004 c <- 1000 p_inclusion <- .5 sigma_y <- 1 # generate covariates X <- array(dim=c(nobs, ncov)) for (i in 1:ncov){ X[, i] <- rnorm(nobs, 0, 1) } included <- rbinom(ncov, 1, p_inclusion) coefs <- rnorm(n=ncov, mean=0, sd=ifelse(included==1, sqrt(var_beta * c), sqrt(var_beta) ) ) coefs <- sort(coefs) Y <- rnorm(nobs, mean=X %*% coefs, sd=sigma_y) |
Specifying the model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | cat("model{ # typical regression priors alpha ~ dnorm(0, .001) sd_y ~ dunif(0, 100) tau_y <- pow(sd_y, -2) # ssvs priors sd_bet ~ dunif(0, 100) tau_in <- pow(sd_bet, -2) tau[1] <- tau_in # coef effectively zero tau[2] <- tau_in / 1000 # nonzero coef p_ind[1] <- 1/2 p_ind[2] <- 1 - p_ind[1] for (j in 1:ncov){ indA[j] ~ dcat(p_ind[]) # returns 1 or 2 gamma[j] <- indA[j] - 1 # returns 0 or 1 beta[j] ~ dnorm(0, tau[indA[j]]) } # likelihood for (i in 1:nobs){ Y[i] ~ dnorm(alpha + X[i ,] %*% beta[], tau_y) } } " , file="ssvs.txt") |
Fitting the model and assessing performance against known values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | require(gridExtra) require(runjags) require(ggmcmc) library(coda) dat <- list(Y=Y, X=X, nobs=nobs, ncov=ncov) vars <- c("alpha", "sd_bet", "gamma", "beta", "tau_in", "sd_y") out2 <- run.jags("ssvs.txt", vars, data=dat, n.chains=3, method="parallel", adapt=5000, burnin=5000) outdf <- ggs(as.mcmc.list(out2)) # extract E(gamma_i) for posterior inclusion probabilities probs <- summary(out2)$statistics[((2 + ncov):(1+2*ncov)), 1] labels <- rep(NA, ncov) for (i in 1:ncov){ labels[i] <- paste("beta[", i, "]", sep="") } xdf <- data.frame(Parameter = labels, value = 1:ncov) p1 <- ggs_caterpillar(outdf, "beta", X=xdf) + theme_classic() + geom_vline(xintercept = 0, linetype = "longdash") + geom_point(data=data.frame(coefs, pos = 1:ncov), aes(x=coefs, y=pos), size=5, col="green4", alpha=.7) + xlab("Value") + ylab("Coefficient") + geom_hline(yintercept = 1:ncov, alpha=.05) + scale_y_continuous(breaks=seq(0, ncov, 1)) df <- data.frame(probs=probs, coefs = coefs) p2 <- ggplot(df, aes(x=abs(coefs), y=probs)) + geom_point(size=5, alpha=.7) + theme_classic() + xlab("Absolute value of true coefficient") + ylab("Posterior probability of non-zeroness") grid.arrange(p1, p2, ncol=2) |
On the left, green points indicate true coefficient values, with black posterior Bayesian credible intervals. The right plot shows the relationship between the true magnitude of the effect and the posterior probability that the coefficient was non-zero, $E(\gamma_i \mid \boldsymbol{Y})$.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.