Site icon R-bloggers

Stochastic search variable selection in JAGS

[This article was first published on Ecology in silico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Stochastic search variable selection (SSVS) identifies promising subsets of multiple regression covariates via Gibbs sampling (George and McCulloch 1993). Here’s a short SSVS demo with JAGS and R.

Assume we have a multiple regression problem:

We suspect only a subset of the elements of $\boldsymbol{\beta}$ are non-zero, i.e. some of the covariates have no effect.

Assume $\boldsymbol{\beta}$ arises from one of two normal mixture components, depending on a latent variable $\gamma_i$:

$\tau_i$ is positive but small s.t. $\beta_i$ is close to zero when $\gamma_i = 0$. $c_i$ is large enough to allow reasonable deviations from zero when $\gamma_i = 1$. The prior probability that covariate $i$ has a nonzero effect is $Pr(\gamma_i = 1) = p_i$. Important subtleties about priors are covered in George and McCulloch (1993) and elsewhere.

Let’s simulate a dataset in which some covariates have strong effects on the linear predictor, and other don’t: suppose we have 20 candidate variables, but only 60 observations.

< notextile>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ncov <- 20
nobs <- 60
var_beta <- .004
c <- 1000
p_inclusion <- .5
sigma_y <- 1

# generate covariates
X <- array(dim=c(nobs, ncov))
for (i in 1:ncov){
  X[, i] <- rnorm(nobs, 0, 1)
}

included <- rbinom(ncov, 1, p_inclusion)
coefs <- rnorm(n=ncov,
               mean=0,
               sd=ifelse(included==1,
                         sqrt(var_beta * c),
                         sqrt(var_beta)
                         )
               )
coefs <- sort(coefs)
Y <- rnorm(nobs, mean=X %*% coefs, sd=sigma_y)

Specifying the model:

< notextile>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat("model{
  # typical regression priors
  alpha ~ dnorm(0, .001)
  sd_y ~ dunif(0, 100)
  tau_y <- pow(sd_y, -2)

  # ssvs priors
  sd_bet ~ dunif(0, 100)
  tau_in <- pow(sd_bet, -2)
  tau[1] <- tau_in            # coef effectively zero
  tau[2] <- tau_in / 1000     # nonzero coef
  p_ind[1] <- 1/2
  p_ind[2] <- 1 - p_ind[1]

  for (j in 1:ncov){
    indA[j] ~ dcat(p_ind[]) # returns 1 or 2
    gamma[j] <- indA[j] - 1   # returns 0 or 1
    beta[j] ~ dnorm(0, tau[indA[j]])
  }

  # likelihood
  for (i in 1:nobs){
    Y[i] ~ dnorm(alpha + X[i ,] %*% beta[], tau_y)
  }
}
    "
    , file="ssvs.txt")

Fitting the model and assessing performance against known values:

< notextile>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
require(gridExtra)
require(runjags)
require(ggmcmc)
library(coda)
dat <- list(Y=Y, X=X, nobs=nobs, ncov=ncov)
vars <- c("alpha", "sd_bet", "gamma", "beta", "tau_in", "sd_y")
out2 <- run.jags("ssvs.txt", vars, data=dat, n.chains=3,
                 method="parallel", adapt=5000, burnin=5000)
outdf <- ggs(as.mcmc.list(out2))

# extract E(gamma_i) for posterior inclusion probabilities
probs <- summary(out2)$statistics[((2 + ncov):(1+2*ncov)), 1]

labels <- rep(NA, ncov)
for (i in 1:ncov){
  labels[i] <- paste("beta[", i, "]", sep="")
}
xdf <- data.frame(Parameter = labels, value = 1:ncov)
p1 <- ggs_caterpillar(outdf, "beta", X=xdf) +
  theme_classic() +
  geom_vline(xintercept = 0, linetype = "longdash") +
  geom_point(data=data.frame(coefs, pos = 1:ncov),
              aes(x=coefs, y=pos), size=5, col="green4", alpha=.7) +
  xlab("Value") +
  ylab("Coefficient") +
  geom_hline(yintercept = 1:ncov, alpha=.05) +
  scale_y_continuous(breaks=seq(0, ncov, 1))

df <- data.frame(probs=probs, coefs = coefs)
p2 <- ggplot(df, aes(x=abs(coefs), y=probs)) +
  geom_point(size=5, alpha=.7) +
  theme_classic() +
  xlab("Absolute value of true coefficient") +
  ylab("Posterior probability of non-zeroness")

grid.arrange(p1, p2, ncol=2)

On the left, green points indicate true coefficient values, with black posterior Bayesian credible intervals. The right plot shows the relationship between the true magnitude of the effect and the posterior probability that the coefficient was non-zero, $E(\gamma_i \mid \boldsymbol{Y})$.

To leave a comment for the author, please follow the link and comment on their blog: Ecology in silico.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.