Site icon R-bloggers

Better living through zero-one inflated beta regression

[This article was first published on Ecology in silico, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dealing with proportion data on the interval $[0, 1]$ is tricky. I realized this while trying to explain variation in vegetation cover. Unfortunately this is a true proportion, and can’t be made into a binary response. Further, true 0’s and 1’s rule out beta regression. You could arcsine square root transform the data (but shouldn’t; Warton and Hui 2011). Enter zero-and-one inflated beta regression.

The zero-and-one-inflated beta distribution facilitates modeling fractional or proportional data that contains both 0’s and 1’s (Ospina and Ferrari 2010 – highly recommended). The general idea is to model the response variable (call it $y$) as a mixture of Bernoulli and beta distributions, from which the true 0’s and 1’s, and the values between 0 and 1 are generated, respectively. The probability density function is

where $0 < \alpha, \gamma, \mu < 1$, and $\phi>0$. $f(y; \mu, \phi)$ is the probability density function for the beta distribution, parameterized in terms of its mean $\mu$ and precision $\phi$:

$\alpha$ is a mixture parameter that determines the extent to which the Bernoulli or beta component dominates the pdf. $\gamma$ determines the probability that $y=1$ if it comes from the Bernoulli component. $\mu$ and $\phi$ are the expected value and the precision for the beta component, which is usually parameterized in terms of $p$ and $q$ (Ferrari and Cribari-Neto 2004). $\mu = \frac{p}{p + q}$, and $\phi=p+q$.

Although ecologists often deal with proportion data, I haven’t found any examples of 0 & 1 inflated beta regression in the ecological literature. Closest thing I’ve found was Nishii and Tanaka (2012) who take a different approach, where values between 0 and 1 are modeled as logit-normal.

Here’s a quick demo in JAGS with simulated data. For simplicity, I’ll assume 1) there is one covariate that increases the expected value at the same rate for both the Bernoulli and beta components s.t. $\mu = \gamma$, and 2) the Bernoulli component dominates extreme values of the covariate, where the expected value is near 0 or 1.

< notextile>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
n <- 100
x <- runif(n, -3, 3)
a0 <- -3
a1 <- 0
a2 <- 1
antilogit <- function(x){
  exp(x) / (1 + exp(x))
}
alpha <- antilogit(a0 + a1 * x + a2 * x^2)

b0 <- 0
b1 <- 2
mu <- antilogit(b0 + b1 * x)
phi <- 5
p <- mu * phi
q <- phi - mu * phi

y.discrete <- rbinom(n, 1, alpha)
y <- rep(NA, n)
for (i in 1:n){
  if (y.discrete[i] == 1){
    y[i] <- rbinom(1, 1, mu[i])
  } else {
    y[i] <- rbeta(1, p[i], q[i])
  }
}

# split the data into discrete and continuous components
y.d <- ifelse(y == 1 | y == 0, y, NA)
y.discrete <- ifelse(is.na(y.d), 0, 1)
y.d <- y.d[!is.na(y.d)]
x.d <- x[y.discrete == 1]
n.discrete <- length(y.d)

which.cont <- which(y < 1 & y > 0)
y.c <- ifelse(y < 1 & y > 0, y, NA)
y.c <- y.c[!is.na(y.c)]
n.cont <- length(y.c)
x.c <- x[which.cont]

Now we can specify our model in JAGS, following the factorization of the likelihood given by Ospina and Ferrari (2010), estimate our parameters, and see how well the model performs.

< notextile>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# write model
cat(
  "
  model{
  # priors
  a0 ~ dnorm(0, .001)
  a1 ~ dnorm(0, .001)
  a2 ~ dnorm(0, .001)
  b0 ~ dnorm(0, .001)
  b1 ~ dnorm(0, .001)
  t0 ~ dnorm(0, .01) 
  tau <- exp(t0)
  
  # likelihood for alpha
  for (i in 1:n){
    logit(alpha[i]) <- a0 + a1 * x[i] + a2 * x[i] ^ 2
    y.discrete[i] ~ dbern(alpha[i])
  }
  
  # likelihood for gamma
  for (i in 1:n.discrete){
    y.d[i] ~ dbern(mu[i])
    logit(mu[i]) <- b0 + b1 * x.d[i]
  }
  
  # likelihood for mu and tau
  for (i in 1:n.cont){
    y.c[i] ~ dbeta(p[i], q[i])
    p[i] <- mu2[i] * tau
    q[i] <- (1 - mu2[i]) * tau
    logit(mu2[i]) <- b0 + b1 * x.c[i]
  }
  }  
  ", file="beinf.txt"
)
require(rjags)
jd <- list(x=x, y.d=y.d, y.c=y.c, y.discrete = y.discrete,
           n.discrete=n.discrete, n.cont = n.cont,
           x.d = x.d, x.c=x.c, n=n)
mod <- jags.model("beinf.txt", data= jd, n.chains=3, n.adapt=1000)
out <- coda.samples(mod, c("a0", "a1", "a2", "b0", "b1", "tau"),
                    n.iter=6000, n.thin=6)
# plot(out) check for convergence

require(ggmcmc)
ggd <- ggs(out)
a0.post <- subset(ggd, Parameter == "a0")$value
a1.post <- subset(ggd, Parameter == "a1")$value
a2.post <- subset(ggd, Parameter == "a2")$value
n.stored <- length(a2.post)
P.discrete <- array(dim=c(n.stored, n))
for (i in 1:n){
  P.discrete[, i] <- antilogit(a0.post + a1.post * x[i] + a2.post * x[i] ^ 2)
}
pdd <- melt(P.discrete, varnames = c("iteration", "site"), value.name = "Pr.discrete")
pdd$x <- x[pdd$site]
b0.post <- subset(ggd, Parameter == "b0")$value
b1.post <- subset(ggd, Parameter == "b1")$value
expect <- array(dim=c(n.stored, n))
for (i in 1:n){
  expect[, i] <- antilogit(b0.post + b1.post * x[i])
}
exd <- melt(expect, varnames=c("iteration", "site"), value.name = "Expectation")
exd$x <- x[exd$site]
obsd <- data.frame(x=x, y=y,
                   component = factor(ifelse(y < 1 & y > 0, "Continuous", "Discrete")))
trued <- data.frame(x=x, mu=mu)

ggplot(pdd) +
  geom_line(aes(x=x, y=Pr.discrete, group=iteration), alpha=0.05, color="grey") +
  geom_line(aes(x=x, y=Expectation, group=iteration), data=exd, color="blue", alpha=0.05) +
  geom_point(aes(x=x, y=y, fill=component), data=obsd, size=3, color="black",
             position = position_jitter(width=0, height=.01), pch=21) +
  scale_fill_manual(values = c("red", "black")) +
  ylab("y") +
  geom_line(aes(x=x, y=mu), data=trued, color="green", linetype="dashed") +
  theme_bw()

Here the posterior probability that $y$ comes from the discrete Bernoulli component is shown in grey, and the posterior expected value for both the Bernoulli and beta components across values of the covariate are shown in blue. The dashed green line shows the true expected value that was used to generate the data. Finally, the observed data are shown as jittered points, color coded as being derived from the continuous beta component, or the discrete Bernoulli component.

To leave a comment for the author, please follow the link and comment on their blog: Ecology in silico.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.