Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data. [As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.] The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance>mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.
First, we will define a few of the variables that are used repeatedly throughout the subsequent code. [Note: click here to get all code from this post in a single file.] We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.
n = 100 # No. of samples per treatment mean.A = 10 # Mean count for treatment A mean.B = 5 # Mean count for treatment B nd = data.frame(Trt = c("A","B")) # Used in predict( ) function
Poisson data
We generate random variates from a Poisson distribution with the rpois( ) function. Because mean=variance in a Poisson distribution, we only need to specify the number of random variates and the expected mean of the distribution.
data.pois = data.frame(Trt = c(rep("A", n), rep("B", n)), Response = c(rpois(n, mean.A), rpois(n, mean.B)) )
Poisson model
Now, we run the GLM and set the error distribution to Poisson,
model.pois = glm(Response ~ Trt, data = data.pois, family = poisson) summary(model.pois)
which produces the following output.
Call: glm(formula = Response ~ Trt, family = poisson, data = data.pois) Deviance Residuals: Min 1Q Median 3Q Max -3.1000 -0.6647 0.1045 0.6032 2.2611 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.30558 0.03158 73.02 < 2e-16 *** TrtB -0.74323 0.05562 -13.36 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 374.94 on 199 degrees of freedom Residual deviance: 183.85 on 198 degrees of freedom AIC: 934.08 Number of Fisher Scoring iterations: 4
We test for goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom.
1 - pchisq(summary(model.pois)$deviance, summary(model.pois)$df.residual )
[1] 0.7565471
The GOF test indicates that the Poisson model fits the data (p > 0.05). If this were your actual data, you could breathe a sigh of relief because you could stop here. Well, not quite here. You will still want to use the model to predict mean counts for each treatment and standard errors for each parameter.
cbind(nd, Mean = predict(model.pois, newdata=nd, type="response"), SE = predict(model.pois, newdata=nd, type="response", se.fit=T)$se.fit )
Trt Mean SE 1 A 10.03 0.3167015 2 B 4.77 0.2184031
Because we used a large sample size, the predicted means are similar to the expected means of 10 and 5.
Negative binomial data
Next we will use the 'MASS' package to generate random deviates from a negative binomial distribution, which involves a parameter, theta, that controls the variance of the distribution.
library(MASS) data.nb = data.frame(Trt = c(rep("A", n), rep("B", n)), Response=c(rnegbin(n, mean.A, 5), rnegbin(n, mean.B, 5)) )
Poisson model
We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).
model.pois.2 = glm(Response ~ Trt, data = data.nb, family = poisson) summary(model.pois.2)
Call: glm(formula = Response ~ Trt, family = poisson, data = data.nb) Deviance Residuals: Min 1Q Median 3Q Max -3.7258 -1.4732 -0.0846 1.1024 4.5505 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32923 0.03120 74.64 < 2e-16 *** TrtB -0.70589 0.05428 -13.01 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 740.63 on 199 degrees of freedom Residual deviance: 560.82 on 198 degrees of freedom AIC: 1283.8 Number of Fisher Scoring iterations: 5
1 - pchisq(summary(model.pois.2)$deviance, summary(model.pois.2)$df.residual )
[1] 0
As expected, the Poisson model does not fit the data (p < 0.05), but we will still take a look at the predicted values.
cbind(nd, Mean = predict(model.pois.2, newdata = nd, type = "response"), SE = predict(model.pois.2, newdata = nd, type = "response", se.fit = T)$se.fit )
Trt Mean SE 1 A 10.27 0.3204684 2 B 5.07 0.2251666
Make a note of the SEs in this output because we will come back to this after we run a GLM based on a negative binomial error distribution.
Negative binomial model
model.nb = glm.nb(Response ~ Trt, data = data.nb) summary(model.nb)
Call: glm.nb(formula = Response ~ Trt, data = data.nb, init.theta = 4.114111123, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.5705 -0.8806 -0.0454 0.5676 2.2889 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.32923 0.05835 39.920 < 2e-16 *** TrtB -0.70589 0.08836 -7.989 1.36e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(4.1141) family taken to be 1) Null deviance: 280.56 on 199 degrees of freedom Residual deviance: 216.42 on 198 degrees of freedom AIC: 1136.8 Number of Fisher Scoring iterations: 1 Theta: 4.114 Std. Err.: 0.668 2 x log-likelihood: -1130.825
The model estimates the dispersion parameter as 4.114; we set theta = 5 when generating random variates.
1 - pchisq(summary(model.nb)$deviance, summary(model.nb)$df.residual )
[1] 0.1756968
The GOF test indicates that the negative binomial model fits the data.
cbind(nd, Mean = predict(model.nb, newdata = nd, type = "response"), SE = predict(model.nb, newdata = nd, type="response", se.fit = T)$se.fit )
Trt Mean SE 1 A 10.27 0.5992233 2 B 5.07 0.3364221
Here you see the 'danger' of ignoring overdispersion in the Poisson model. The SE estimates are lower for the Poisson model than for the negative binomial model, which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.
Zero-inflated Poisson data
Lastly, we will add more more layer of complication to the story. If you have lots of zeros in your data, and have determined that Poisson and negative binomial models do not fit your data well, then you should turn to zero-inflated models with either Poisson or negative binomial error distributions. We need the 'VGAM' package to generate random variates from a zero-inflated Poisson distribution using the rzipois( ) function. The 3rd argument to the rzipois( ) function specifies the probability of drawing a zero beyond the expected number of zeros for a Poisson distribution with the specified mean. Here were are introducing a relatively small proportion of extra zeros and the same proportion for each treatment.
library(VGAM) data.zip = data.frame(Trt = c(rep("A", n), rep("B", n)), Response = c(rzipois(n, mean.A, 0.2), rzipois(n, mean.B, 0.2)) )
Poisson model
We first fit the Poisson model.
model.pois.3 = glm(Response ~ Trt, data = data.zip, family = poisson) summary(model.pois.3)
Call: glm(formula = Response ~ Trt, family = poisson, data = data.zip) Deviance Residuals: Min 1Q Median 3Q Max -3.9166 -1.8131 0.1183 1.3324 2.8984 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.03732 0.03611 56.42 < 2e-16 *** TrtB -0.64107 0.06147 -10.43 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 869.33 on 199 degrees of freedom Residual deviance: 754.93 on 198 degrees of freedom AIC: 1337.3 Number of Fisher Scoring iterations: 5
1 - pchisq(summary(model.pois.3)$deviance, summary(model.pois.3)$df.residual )
[1] 0
The Poisson model does not fit the data.
cbind(nd, Mean = predict(model.pois.3, newdata = nd, type = "response"), SE = predict(model.pois.3, newdata = nd, type = "response", se.fit = T)$se.fit )
Trt Mean SE 1 A 7.67 0.2769476 2 B 4.04 0.2009973
The Poisson model also does not predict the correct mean counts.
Negative binomial model
Next, we fit a negative binomial model.
model.nb.2 = glm.nb(Response ~ Trt, data = data.zip) summary(model.nb.2)
Call: glm.nb(formula = Response ~ Trt, data = data.zip, init.theta = 1.47231218, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.3189 -0.9259 0.0472 0.5432 1.2916 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.03732 0.08998 22.643 < 2e-16 *** TrtB -0.64107 0.13177 -4.865 1.14e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Negative Binomial(1.4723) family taken to be 1) Null deviance: 280.22 on 199 degrees of freedom Residual deviance: 256.67 on 198 degrees of freedom AIC: 1122 Number of Fisher Scoring iterations: 1 Theta: 1.472 Std. Err.: 0.234 2 x log-likelihood: -1116.006
1 - pchisq(summary(model.nb.2)$deviance, summary(model.nb.2)$df.residual )
[1] 0.003149441
The negative binomial model does not fit the data.
cbind(nd, Mean = predict(model.nb.2, newdata = nd, type = "response"), SE = predict(model.nb.2, newdata = nd, type = "response", se.fit = T)$se.fit )
Trt Mean SE 1 A 7.67 0.6901218 2 B 4.04 0.3889176
And does not predict the correct means.
Zero-inflated Poisson models
We load the 'pscl' package to fit the zero-inflated model. First, we fit a model where we assume that the probability of zero is the same for both treatments (with '~ Trt|1').
library(pscl) model.zip = zeroinfl(Response ~ Trt|1, data = data.zip) summary(model.zip)
Call: zeroinfl(formula = Response ~ Trt | 1, data = data.zip) Pearson residuals: Min 1Q Median 3Q Max -1.5073 -0.9286 0.1330 0.8941 2.1819 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26040 0.03612 62.584 < 2e-16 *** TrtB -0.55430 0.06203 -8.937 < 2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1903 0.1681 -7.082 1.42e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 13 Log-likelihood: -477 on 3 Df
As expected, the model output indicates that there are significantly more zeros than expected for a Poisson distribution.
cbind(nd, Count = predict(model.zip, newdata = nd, type = "count"), Zero = predict(model.zip, newdata = nd, type = "zero") )
Trt Count Zero 1 A 9.586959 0.2332005 2 B 5.507434 0.2332005
The zero-inflated model predicts the correct mean counts and probability of zero.
If we fit a zero-inflated model to test a treatment effect for both the counts and the zeros (with '~ Trt|Trt'),
model.zip.2 = zeroinfl(Response ~ Trt|Trt, data = data.zip) summary(model.zip.2)
Call: zeroinfl(formula = Response ~ Trt | Trt, data = data.zip) Pearson residuals: Min 1Q Median 3Q Max -1.62159 -0.96198 0.06977 0.91545 2.20244 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26039 0.03612 62.580 < 2e-16 *** TrtB -0.55348 0.06194 -8.936 < 2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.3866 0.2501 -5.545 2.94e-08 *** TrtB 0.3769 0.3383 1.114 0.265 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 16 Log-likelihood: -476.4 on 4 Df
we see that there are significantly more zeros than expected, but that the probability of zero is not significantly different between the two treatments.
cbind(nd, Count = predict(model.zip.2, newdata = nd, type = "count"), Zero = predict(model.zip.2, newdata = nd, type = "zero") )
Trt Count Zero 1 A 9.586842 0.1999451 2 B 5.511897 0.2670400
Zero-inflated negative binomial model
We can test for overdispersion in the count part of the zero-inflated model by specifying a negative binomial distribution.
model.zip.3 = zeroinfl(Response ~ Trt|1, data = data.zip, dist = "negbin") summary(model.zip.3)
Call: zeroinfl(formula = Response ~ Trt | 1, data = data.zip, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -1.5072 -0.9285 0.1330 0.8940 2.1818 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.26040 0.03612 62.574 < 2e-16 *** TrtB -0.55431 0.06203 -8.935 < 2e-16 *** Log(theta) 10.26954 66.78568 0.154 0.878 Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1903 0.1681 -7.082 1.42e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 28840.7026 Number of iterations in BFGS optimization: 32 Log-likelihood: -477 on 4 Df
The estimated theta parameter is not significant indicating that the zero-inflated Poisson model is appropriate.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.