Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Presence probability, typically obtained with presence-(pseudo)absence modelling methods like GLM, GAM, GBM or Random Forest, is conditional not only on the suitability of the environmental conditions, but also on the general prevalence (proportion of presences) of the species in the study area. So, a species with few presences will generally have low presence probabilities, even in suitable conditions, simply because its presence is indeed rare.
As species distribution modellers often want to remove the efect of unbalanced prevalence from model predictions, a common procedure is to only pick (pseudo)absences in the same number as the presences, for a modelled prevalence of 50%, though this may imply significant loss of data. Alternatively, most modelling functions (e.g. glm()
of base R, but also functions that implement GAM, GBM, Random Forests, etc. in a variety of R packages) allow attributing different weights to presences and absences, although they typically do not do this by default. However, some modelling packages that use these functions, like ENMTools
and biomod2
, alter these defaults and do apply different weights to presences and absences, to balance their contributions and thus produce prevalence-independent suitability predictions. Beware: as per these packages’ help files (which users should always read!), ENMTools
does this by default (see e.g. here), while biomod2
does it by default when the pseudoabsences or background points are automatically generated by the package, but not when they are provided by the user (see e.g. here).
A less compromising alternative may be the favourability function (Real et al., 2006; Acevedo & Real, 2012), which removes the effect of species prevalence from predictions of actual presence probability, without the need to restrict the number of (pseudo)absences to the same as the number of presences (i.e. without losing data), and without the need to alter the actual contributions of the data by attributing them different weights. Below is a simple and reproducible comparison between favourability, raw probability, and probability based on a model with down-weighted absences so that they balance the number of presences. I’ve used GLM, but this applies to other presence-(pseudo)absence models as well.
library(fuzzySim) data("rotif.env") names(rotif.env) spp_cols <- names(rotif.env)[18:47] var_cols <- names(rotif.env)[5:17] nrow(rotif.env) # 291 sites sort(sapply(rotif.env[ , spp_cols], sum)) # from 99 to 172 presences sort(sapply(rotif.env[ , spp_cols], fuzzySim::prevalence)) # from 34 to 59% species <- spp_cols[8] # 8 for example; try with others too species npres <- sum(rotif.env[ , species]) nabs <- nrow(rotif.env) - npres npres nabs prevalence(rotif.env[ , species]) # set weights as in weights="equal" of ENMTools::enmtools.glm(): weights <- rep(NA, nrow(rotif.env)) weights[rotif.env[ , species] == 1] <- 1 weights[rotif.env[ , species] == 0] <- npres / nabs weights sum(weights[rotif.env[ , species] == 1]) sum(weights[rotif.env[ , species] == 0]) # same formula <- reformulate(termlabels = var_cols, response = species) formula mod <- glm(formula, data = rotif.env, family = binomial) modw <- glm(formula, data = rotif.env, family = binomial, weights = weights) pred <- predict(mod, rotif.env, type = "response") predw <- predict(modw, rotif.env, type = "response") fav <- Fav(mod) # note favourability is only applicable to unweighted predictons par(mfrow = c(1, 3)) plot(pred, predw, pch = 20, cex = 0.2) # curve plot(pred, fav, pch = 20, cex = 0.2) # cleaner curve plot(fav, predw, pch = 20, cex = 0.2) # ~linear but with noise
par(mfrow = c(1, 1)) plot(pred[order(pred)], pch = 20, cex = 0.5, col = "grey30", ylab = "Prediction") points(predw[order(pred)], pch = 20, cex = 0.5, col = "blue") # higher, as expected after down-weighting the unbalancedly numerous absences points(fav[order(pred)], pch = 20, cex = 0.5, col = "salmon") # higher like predw, but with less noise (more like the original pred) legend("topleft", legend = c("probability", "weighted probability", "favourability"), pch = 20, col = c("grey30", "blue", "salmon"), bty = "n")
As you can see, favourability and weighted probability (which serve the same purpose of removing the effect of unbalanced sample prevalence on model predictions) are highly similar. However, favourability does not alter the original data in any way (i.e., it lets the model weigh presences and absences proportionally to the numbers in which they actually occur in the data); and it provides less noisy results that are more aligned with the original (unweighted, non-manipulated) probability.
I’ve checked this for some species already, but further tests and feedback are welcome!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.