Site icon R-bloggers

A minor update to simstudy provides an excuse to talk a bit about the negative binomial and Poisson distributions

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I just updated simstudy to version 0.1.5 (available on CRAN) so that it now includes several new distributions – exponential, discrete uniform, and negative binomial.

As part of the release, I thought I’d explore the negative binomial just a bit, particularly as it relates to the Poisson distribution. The Poisson distribution is a discrete (integer) distribution of outcomes of non-negative values that is often used to describe count outcomes. It is characterized by a mean (or rate) and its variance equals its mean.

Added variation

In many situations, when count data are modeled, it turns out that the variance of the data exceeds the mean (a situation called over-dispersion). In this case an alternative model is used that allows for the greater variance, which is based on the negative binomial distribution. It turns out that if the negative binomial distribution has mean \(\mu\), it has a variance of \(\mu + \theta \mu^2\), where \(\theta\) is called a dispersion parameter. If \(\theta = 0\), we have the Poisson distribution, but otherwise the variance of a negative binomial random variable will exceed the variance of a Poisson random variable as long as they share the same mean, because \(\mu > 0\) and \(\theta \ge 0\).

We can see this by generating data from each distribution with mean 15, and a dispersion parameter of 0.2 for the negative binomial. We expect a variance around 15 for the Poisson distribution, and 60 for the negative binomial distribution.

library(simstudy)
library(ggplot2)

# for a less cluttered look

theme_no_minor <- function(color = "grey90") {    
  theme(panel.grid.minor = element_blank(),
        panel.background = element_rect(fill="grey95")
  )
}

options(digits = 2)

# define data

defC <- defCondition(condition = "dist == 0", formula = 15, 
                     dist = "poisson", link = "identity")

defC <- defCondition(defC, condition = "dist == 1", formula = 15, 
                     variance = 0.2, dist = "negBinomial", 
                     link = "identity")

# generate data

set.seed(50)
dt <- genData(500)
dt <- trtAssign(dt, 2, grpName = "dist")
dt <- addCondition(defC, dt, "y")
genFactor(dt, "dist", c("Poisson", "Negative binomial"))

# compare distributions

dt[, .(mean = mean(y), var = var(y)), keyby = fdist]
##                fdist mean var
## 1:           Poisson   15  15
## 2: Negative binomial   15  54
ggplot(data = dt, aes(x = y, group = fdist)) +
  geom_density(aes(fill=fdist), alpha = .4) +
  scale_fill_manual(values = c("#808000", "#000080")) +
  scale_x_continuous(limits = c(0,60), 
                     breaks = seq(0, 60, by = 20)) +
  theme_no_minor() +
  theme(legend.title = element_blank(),
        legend.position = c(0.80, 0.83))

Underestimating standard errors

In the context of a regression, misspecifying a model as Poisson rather than negative binomial, can lead to an underestimation of standard errors, even though the point estimates may be quite reasonable (or may not). The Poisson model will force the variance estimate to be equal to the mean at any particular point on the regression curve. The Poisson model will effectively ignore the true extent of the variation, which can lead to problems of interpretation. We might conclude that there is an association when in fact there is none.

In this simple simulation, we generate two predictors (\(x\) and \(b\)) and an outcome (\(y\)). The outcome is a function of \(x\) only:

library(broom)
library(MASS)

# Generating data from negative binomial dist

def <- defData(varname = "x", formula = 0, variance = 1, 
               dist = "normal")
def <- defData(def, varname = "b", formula = 0, variance = 1, 
               dist = "normal")
def <- defData(def, varname = "y", formula = "0.9 + 0.6*x", 
               variance = 0.3, dist = "negBinomial", link = "log")

set.seed(35)
dt <- genData(500, def)

ggplot(data = dt, aes(x=x, y = y)) +
  geom_jitter(width = .1) +
  ggtitle("Outcome as function of 1st predictor") +
  theme_no_minor()

ggplot(data = dt, aes(x=b, y = y)) +
  geom_jitter(width = 0) +
  ggtitle("Outcome as function of 2nd predictor") +
  theme_no_minor()

I fit two models using both predictors. The first assumes (incorrectly) a Poisson distribution, and the second assumes (correctly) a negative binomial distribution. We can see that although the point estimates are quite close, the standard error estimates for the predictors in the Poisson model are considerably greater (about 50% higher) than the negative binomial model. And if we were basing any conclusion on the p-value (which is not always the obvious way to do things), we might make the wrong call since the p-value for the slope of \(b\) is estimated to be 0.029. Under the correct model model, the p-value is 0.29.

glmfit <- glm(y ~ x + b, data = dt, family = poisson (link = "log") )
tidy(glmfit)
##          term estimate std.error statistic  p.value
## 1 (Intercept)    0.956     0.030      32.3 1.1e-228
## 2           x    0.516     0.024      21.9 1.9e-106
## 3           b   -0.052     0.024      -2.2  2.9e-02
nbfit <- glm.nb(y ~ x + b, data = dt)
tidy(nbfit)
##          term estimate std.error statistic  p.value
## 1 (Intercept)    0.954     0.039      24.2 1.1e-129
## 2           x    0.519     0.037      14.2  7.9e-46
## 3           b   -0.037     0.036      -1.1  2.9e-01

A plot of the fitted regression curve and confidence bands of \(b\) estimated by each model reinforces the difference. The lighter shaded region is the wider confidence band of the negative binomial model, and the darker shaded region the based on the Poisson model.

newb <- data.table(b=seq(-3,3,length = 100), x = 0)

poispred <- predict(glmfit, newdata = newb, se.fit = TRUE, 
                    type = "response")
nbpred <-predict(nbfit, newdata = newb, se.fit = TRUE, 
                 type = "response")

poisdf <- data.table(b = newb$b, y = poispred$fit, 
                     lwr = poispred$fit - 1.96*poispred$se.fit,
                     upr = poispred$fit + 1.96*poispred$se.fit)

nbdf <- data.table(b = newb$b,  y = nbpred$fit, 
                     lwr = nbpred$fit - 1.96*nbpred$se.fit,
                     upr = nbpred$fit + 1.96*nbpred$se.fit)

ggplot(data = poisdf, aes(x=b, y = y)) +
  geom_line() +
  geom_ribbon(data=nbdf, aes(ymin = lwr, ymax=upr), alpha = .3, 
              fill = "red") +
  geom_ribbon(aes(ymin = lwr, ymax=upr), alpha = .5, 
              fill = "red") +
  theme_no_minor()

And finally, if we take 500 samples of size 500, and estimate slope for \(b\) each time and calculate the standard deviation of those estimates, it is quite close to the standard error estimate we saw in the model of the original simulated data set using the negative binomial assumption (0.036). And the mean of those estimates is quite close to zero, the true value.

result <- data.table()

for (i in 1:500) {
  
  dt <- genData(500, def)
  glmfit <- glm(y ~ x + b, data = dt, family = poisson)
  nbfit <- glm.nb(y ~ x + b, data = dt)

  result <- rbind(result, data.table(bPois = coef(glmfit)["b"],
                                     bNB = coef(nbfit)["b"])
  )
  
}

result[,.(sd(bPois), sd(bNB))]   # observed standard error
##       V1    V2
## 1: 0.037 0.036
result[,.(mean(bPois), mean(bNB))] # observed mean
##        V1     V2
## 1: 0.0025 0.0033

Negative binomial as mixture of Poissons

An interesting relationship between the two distributions is that a negative binomial distribution can be generated from a mixture of individuals whose outcomes come from a Poisson distribution, but each individual has her own rate or mean. Furthermore, those rates must have a specific distribution – a Gamma. (For much more on this, you can take a look here.) Here is a little simulation:

mu = 15
disp = 0.2

# Gamma distributed means

def <- defData(varname = "gmu", formula = mu, variance = disp, 
               dist = "gamma")

# generate data from each distribution

defC <- defCondition(condition = "nb == 0", formula = "gmu", 
                     dist = "poisson")

defC <- defCondition(defC, condition = "nb == 1", formula = mu,
                     variance = disp, dist = "negBinomial")

dt <- genData(5000, def)
dt <- trtAssign(dt, 2, grpName = "nb")
genFactor(dt, "nb", labels = c("Poisson-Gamma", "Negative binomial"))

dt <- addCondition(defC, dt, "y")

# means and variances should be very close

dt[, .(Mean = mean(y), Var = var(y)), keyby = fnb]
##                  fnb Mean Var
## 1:     Poisson-Gamma   15  62
## 2: Negative binomial   15  57
# plot

ggplot(data = dt, aes(x = y, group = fnb)) +
  geom_density(aes(fill=fnb), alpha = .4) +
  scale_fill_manual(values = c("#808000", "#000080")) +
  scale_x_continuous(limits = c(0,60), 
                     breaks = seq(0, 60, by = 20)) +
  theme_no_minor() +
  theme(legend.title = element_blank(),
        legend.position = c(0.80, 0.83))

“`

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.