Occupancy model fit & AUC
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Occupancy models are used to understand species distributions while accounting for imperfect detection. In this post, I’ll demonstrate a method to evaluate the performance of occupancy models based on the area under a receiver operating characteristic curve (AUC), as published last year by Elise Zipkin and colleagues in Ecological Applications.
Suppose we are to fit a multi-year occupancy model for one species. We will evaluate the fit based on how well the model predicts occupancy in the final year of the project. Start by simulating some data (for details on the structure of these simulated data, refer to this post and references therein):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | # define convenience functions logit <- function(x) { log(x/(1 - x)) } antilogit <- function(x) { exp(x)/(1 + exp(x)) } # initialize parameters nsite <- 100 nyear <- 5 nrep <- 3 # yearly Pr(colonization) p_beta = 0.5 beta <- logit(p_beta) # yearly Pr(persistence) p_rho = 0.6 rho <- logit(p_rho) # assume there are 2 site level covariates: covariate and treatment # response to covariate p_alpha1 <- .6 alpha1 <- logit(p_alpha1) # response to treatment p_alpha2 <- .5 alpha2 <- logit(p_alpha2) # interaction between covariate & treatment p_alpha3 <- .2 alpha3 <- logit(p_alpha3) # Site-specific covariates covariate <- rnorm(nsite, 0, 1) treat <- rbinom(nsite, size=1, prob=.5) # initial occupancy states z0 <- rbinom(nsite, 1, .5) # subsequent occupancy z <- array(dim = c(nsite, nyear)) lpsi <- array(dim = c(nsite, nyear)) psi <- array(dim = c(nsite, nyear)) for (j in 1:nsite) { for (t in 1:nyear) { if (t == 1) { # logit psi lpsi[j, t] <- beta + rho * z0[j] + covariate[j] * alpha1 + treat[j] * alpha2 + alpha3 * covariate[j] * treat[j] psi[j, t] <- antilogit(lpsi[j, t]) z[j, t] <- rbinom(1, 1, psi[j, t]) # true occupancy state } else { lpsi[j, t] <- beta + rho * z[j, t - 1] + covariate[j] * alpha1 + treat[j] * alpha2 + alpha3 * covariate[j] * treat[j] psi[j, t] <- antilogit(lpsi[j, t]) z[j, t] <- rbinom(1, 1, psi[j, t]) } } } # detection probability p <- 0.6 # observations x <- array(dim = c(nsite, nyear, nrep)) for (j in 1:nsite) { for (t in 1:nyear) { for (k in 1:nrep) { x[j, t, k] <- rbinom(1, 1, p * z[j, t]) } } } # Subset the data to exclude the final year, which we'll use for model validation xsub <- x xsub[, nyear, ] <- NA |
For illustration, I included a strong interaction between treatment and the continuous site level covariate (could be elevation, area, etc). As such, a measure of model fit such as AUC ought to identify a saturated model as the best fitting. Handily, AUC is a derived parameter, and common occupancy model parameters can be used to estimate a posterior. To generate a posterior AUC, we need predicted occupancy probabilities ($\psi$) and realized occupancy states ($Z$) in the final year. Predicted occupancy probabilities can be produced using data from previous years, and realized occupancy states are assumed to be represented by the posterior for $Z$ generated from a single-year model, fit to the data from the final year of the study.
Fitting a saturated model
Begin by modeling occupancy probabilities as a function of both covariates and their interaction, predicting $\psi$ in the final year:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | # model specification cat(" model{ #### priors p_beta ~ dbeta(1, 1) beta <- log(p_beta / (1 - p_beta)) p_rho ~ dbeta(1, 1) rho <- log(p_rho / (1 - p_rho)) p_rho0 ~ dbeta(1, 1) rho0 <- log(p_rho0 / (1 - p_rho0)) p ~ dbeta(1, 1) pa1 ~ dbeta(1, 1) pa2 ~ dbeta(1, 1) pa3~ dbeta(1, 1) a1 <- log(pa1 / (1 - pa1)) a2 <- log(pa2 / (1 - pa2)) a3 <- log(pa3 / (1 - pa3)) #### occupancy model for (j in 1:nsite) { z0[j] ~ dbern(rho0) logit(psi[j, 1]) <- beta + rho * z0[j] + a1 * covariate[j] + a2 * treat[j] + a3 * covariate[j] * treat[j] z[j, 1] ~ dbern(psi[j, 1]) for (t in 2:nyear) { logit(psi[j, t]) <- beta + rho * z[j, t-1] + a1 * covariate[j] + a2 * treat[j] + a3 * covariate[j] * treat[j] z[j, t] ~ dbern(psi[j, t]) } } #### observation model for (j in 1:nsite){ for (t in 1:nyear){ mu[j, t] <- z[j, t] * p for (k in 1:nrep){ x[j, t, k] ~ dbern(mu[j, t]) } } } } ", fill=TRUE, file="saturated.txt") # bundle data data <- list(x = xsub, nrep = nrep, nsite = nsite, nyear = nyear, covariate=covariate, treat=treat) # initial values zinit <- array(dim = c(nsite, nyear)) for (j in 1:nsite) { for (t in 1:nyear) { zinit[j, t] <- max(xsub[j, t, ]) } } inits <- function() { list(z = zinit) } # parameters to monitor params <- c("p", "beta", "rho", "psi", "pa1", "pa2", "pa3") # use JAGS require(rjags) # build model nchain <- 3 ocmod <- jags.model(file = "saturated.txt", inits = inits, data = data, n.chains = nchain) # specify MCMC settings and start sampling nburn <- 2000 niter <- 2000 nthin <- 2 update(ocmod, n.iter = nburn) out <- coda.samples(ocmod, n.iter = niter, thin=nthin, variable.names = params) |
Now that we have our posteriors for $\psi$ at each site in the final year, we can fit a single-year model to the final year’s data to estimate $Z$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | xlast <- x[,nyear,] # model specification cat(" model{ #### priors p_beta ~ dbeta(1, 1) beta <- log(p_beta / (1 - p_beta)) p ~ dbeta(1, 1) pa1 ~ dbeta(1, 1) pa2 ~ dbeta(1, 1) pa3~ dbeta(1, 1) a1 <- log(pa1 / (1 - pa1)) a2 <- log(pa2 / (1 - pa2)) a3 <- log(pa3 / (1 - pa3)) #### occupancy model for (j in 1:nsite) { logit(psi[j]) <- beta + a1 * covariate[j] + a2 * treat[j] + a3 * covariate[j] * treat[j] z[j] ~ dbern(psi[j]) } #### observation model for (j in 1:nsite){ mu[j] <- z[j] * p for (k in 1:nrep){ x[j, k] ~ dbern(mu[j]) } } } ", fill=TRUE, file="saturated2.txt") # bundle data data2 <- list(x = xlast, nrep = nrep, nsite = nsite, covariate=covariate, treat=treat) # initial values zinit <- array(dim = c(nsite)) for (j in 1:nsite) { zinit[j] <- max(xlast[j, ]) } inits <- function() { list(z = zinit) } # parameters to monitor params2 <- c("p", "beta", "pa1", "pa2", "pa3", "z") # build model ocmod2 <- jags.model(file = "saturated2.txt", inits = inits, data = data2, n.chains = nchain) # start sampling update(ocmod2, n.iter = nburn) out2 <- coda.samples(ocmod2, n.iter = niter, thin=nthin, variable.names = params2) |
To set up the data for AUC calculations, produce site by iteration arrays for $\psi$ and $Z$:
1 2 3 4 5 6 7 8 9 10 11 12 | require(ggmcmc) ggd <- ggs(out) ggd2 <- ggs(out2) nstore <- niter / nthin psi.pred <- array(dim=c(nsite, nstore*nchain)) Z.est <- array(dim=c(nsite, nstore*nchain)) for (j in 1:nsite){ indx <- which(ggd$Parameter == paste("psi[", j, ",", nyear, "]", sep="")) psi.pred[j, ] <- ggd$value[indx] indx <- which(ggd2$Parameter == paste("z[", j, "]", sep="")) Z.est[j,] <- ggd2$value[indx] } |
Now generate the posterior for AUC and store data on the true and false positive rates to produce ROC curves.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | require(ROCR) AUC1 <- rep(NA, nstore * nchain) fpr <- array(dim=c(nstore * nchain, nsite + 1)) tpr <- array(dim=c(nstore * nchain, nsite + 1)) for (i in 1:(nstore * nchain)){ psi.vals <- psi.pred[, i] Z.vals <- Z.est[, i] pred <- prediction(psi.vals, factor(Z.vals, levels=c("0", "1"))) perf <- performance(pred, "auc") AUC1[i] <- [email protected][[1]] perf <- performance(pred, "tpr","fpr") fpr[i, ] <- [email protected][[1]] tpr[i, ] <- [email protected][[1]] } require(reshape2) fprm <- melt(fpr, varnames=c("iter", "site")) tprm <- melt(tpr, varnames = c("iter", "site")) ROC1 <- data.frame(fpr = fprm$value, tpr = tprm$value, iter = rep(fprm$iter, 2)) |
Fitting a simpler model
Having fitted a saturated model, we can now fit a simpler model that includes only main effects:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | cat(" model{ #### priors p_beta ~ dbeta(1, 1) beta <- log(p_beta / (1 - p_beta)) p_rho ~ dbeta(1, 1) rho <- log(p_rho / (1 - p_rho)) p_rho0 ~ dbeta(1, 1) rho0 <- log(p_rho0 / (1 - p_rho0)) p ~ dbeta(1, 1) pa1 ~ dbeta(1, 1) pa2 ~ dbeta(1, 1) a1 <- log(pa1 / (1 - pa1)) a2 <- log(pa2 / (1 - pa2)) #### occupancy model for (j in 1:nsite) { z0[j] ~ dbern(rho0) logit(psi[j, 1]) <- beta + rho * z0[j] + a1 * covariate[j] + a2 * treat[j] z[j, 1] ~ dbern(psi[j, 1]) for (t in 2:nyear) { logit(psi[j, t]) <- beta + rho * z[j, t-1] + a1 * covariate[j] + a2 * treat[j] z[j, t] ~ dbern(psi[j, t]) } } #### observation model for (j in 1:nsite){ for (t in 1:nyear){ mu[j, t] <- z[j, t] * p for (k in 1:nrep){ x[j, t, k] ~ dbern(mu[j, t]) } } } } ", fill=TRUE, file="mainef.txt") # bundle data data <- list(x = xsub, nrep = nrep, nsite = nsite, nyear = nyear, covariate=covariate, treat=treat) # initial values zinit <- array(dim = c(nsite, nyear)) for (j in 1:nsite) { for (t in 1:nyear) { zinit[j, t] <- max(xsub[j, t, ]) } } inits <- function() { list(z = zinit) } # parameters to monitor params <- c("p", "beta", "rho", "psi", "pa1", "pa2") # build model ocmod <- jags.model(file = "mainef.txt", inits = inits, data = data, n.chains = nchain) # specify MCMC settings and start sampling update(ocmod, n.iter = nburn) out <- coda.samples(ocmod, n.iter = niter, thin=nthin, variable.names = params) # Fit a single-year model for the final year to estimate Z ## # model specification cat(" model{ #### priors p_beta ~ dbeta(1, 1) beta <- log(p_beta / (1 - p_beta)) p ~ dbeta(1, 1) pa1 ~ dbeta(1, 1) pa2 ~ dbeta(1, 1) a1 <- log(pa1 / (1 - pa1)) a2 <- log(pa2 / (1 - pa2)) #### occupancy model for (j in 1:nsite) { logit(psi[j]) <- beta + a1 * covariate[j] + a2 * treat[j] z[j] ~ dbern(psi[j]) } #### observation model for (j in 1:nsite){ mu[j] <- z[j] * p for (k in 1:nrep){ x[j, k] ~ dbern(mu[j]) } } } ", fill=TRUE, file="mainef2.txt") # bundle data data2 <- list(x = xlast, nrep = nrep, nsite = nsite, covariate=covariate, treat=treat) # initial values zinit <- array(dim = c(nsite)) for (j in 1:nsite) { zinit[j] <- max(xlast[j, ]) } inits <- function() { list(z = zinit) } # parameters to monitor params2 <- c("p", "beta", "pa1", "pa2", "z") # build model ocmod2 <- jags.model(file = "mainef2.txt", inits = inits, data = data2, n.chains = nchain) # start sampling update(ocmod2, n.iter = nburn) out2 <- coda.samples(ocmod2, n.iter = niter, thin=nthin, variable.names = params2) # Generating AUC posterior... ggd <- ggs(out) ggd2 <- ggs(out2) nstore <- niter / nthin psi.pred <- array(dim=c(nsite, nstore*nchain)) Z.est <- array(dim=c(nsite, nstore*nchain)) for (j in 1:nsite){ indx <- which(ggd$Parameter == paste("psi[", j, ",", nyear, "]", sep="")) psi.pred[j, ] <- ggd$value[indx] indx <- which(ggd2$Parameter == paste("z[", j, "]", sep="")) Z.est[j,] <- ggd2$value[indx] } AUC2 <- rep(NA, nstore * nchain) fpr <- array(dim=c(nstore * nchain, nsite + 1)) tpr <- array(dim=c(nstore * nchain, nsite + 1)) for (i in 1:(nstore * nchain)){ psi.vals <- psi.pred[, i] Z.vals <- Z.est[, i] pred <- prediction(psi.vals, factor(Z.vals, levels=c("0", "1"))) perf <- performance(pred, "auc") AUC2[i] <- [email protected][[1]] perf <- performance(pred, "tpr","fpr") fpr[i, ] <- [email protected][[1]] tpr[i, ] <- [email protected][[1]] } fprm <- melt(fpr, varnames=c("iter", "site")) tprm <- melt(tpr, varnames = c("iter", "site")) ROC2 <- data.frame(fpr = fprm$value, tpr = tprm$value, iter = rep(fprm$iter, 2)) |
Comparing models
How well did our models predict occupancy in the final year of the study, and was one better than the other? We can answer this question by inspecting posteriors for AUC (larger values are better), and the ROC curves.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | p1 <- ggplot(ROC1, aes(x=fpr, y=tpr, group=iter)) + geom_line(alpha=0.05, color="blue") + geom_abline(intercept = 0, slope = 1, linetype="dashed") + ylab("True positive rate") + xlab("False positive rate") + theme_bw() p2 <- ggplot(ROC2, aes(x=fpr, y=tpr, group=iter)) + geom_line(alpha=0.05, color="red") + geom_abline(intercept = 0, slope = 1, linetype="dashed") + ylab("True positive rate") + xlab("False positive rate") + theme_bw() p3 <- ggplot(data.frame(AUC = c(AUC1, AUC2), Model = rep(c("Saturated model", "Main effects model "), each=length(AUC1))), aes(x=AUC, fill=Model, color=Model)) + geom_density(alpha=.8) + theme_bw() + ylab("Posterior density") + xlab("Area under receiver operating characteristic curve (AUC)") + scale_colour_manual(values=c("red", "blue")) + scale_fill_manual(values=c("red", "blue")) + theme(legend.position = "top") + theme(legend.title=element_blank()) + ggtitle("Posterior AUC and ROC curve comparison") require(gridExtra) grid.arrange(p3, arrangeGrob(p2, p1, ncol=2), ncol=1) |
As expected, the model that generated the data fits better than the model that excludes the strong interaction term. Note that AUC reflects the accuracy of model predictions, and does not penalize model complexity.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.