Count data Models
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction:
When we deal with data that has a response variable of integer type, using a linear regression may violate the normality assumption and hence all the classical statistic tests would fail to evaluate the model. However, as we do with logistic regression models, the generalized linear model GLM can be used instead here by specifying the suitable distribution.
The possible distributions for this type of data are the discrete distributions poisson and negative binomial. The former is the best choice if the mean and the variance of the response variable are closer to each other, if they are not however and we persist using this distribution we may cause the rise of the overdispersion problem of the residuals. As a solution thus, we can use the latter distribution that does not have this restriction.
There is another alternative if neither the poisson distribution nor the negative binomial are suitable called the Quasi maximum likelihood. The advantage of this method is that uses only the relationship between the mean and the variance and does not require any prespecified distribution. Moreover, its estimators are approximately as efficient as the maximum likelihood estimators.
Data preparation
To well understand how to model the count data we are going be using Doctorvisits data from AER package, in which the variable visits will be our target variable, so let’s call this data with the packages that we need along this article.
ssh <- suppressPackageStartupMessages ssh(library(performance)) ssh(library(ModelMetrics)) ssh(library(corrr)) ssh(library(purrr)) ssh(library(MASS)) ssh(library(tidyverse)) ssh(library(AER)) ssh(library(broom)) data("DoctorVisits") doc <- DoctorVisits glimpse(doc) Observations: 5,190 Variables: 12 $ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, ... $ gender <fct> female, female, male, male, male, female, female, female,... $ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.1... $ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.1... $ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, ... $ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0,... $ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, ... $ private <fct> yes, yes, no, no, no, no, no, no, yes, yes, no, no, no, n... $ freepoor <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n... $ freerepat <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no, no, ... $ nchronic <fct> no, no, no, no, yes, yes, no, no, no, no, no, no, yes, ye... $ lchronic <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
This data from Australian health survey where visits is the number of doctor visits in past two weeks with 11 features listed above.
First we list the summary of the data to inspect any unwanted issue.
summary(doc) visits gender age income Min. :0.0000 male :2488 Min. :0.1900 Min. :0.0000 1st Qu.:0.0000 female:2702 1st Qu.:0.2200 1st Qu.:0.2500 Median :0.0000 Median :0.3200 Median :0.5500 Mean :0.3017 Mean :0.4064 Mean :0.5832 3rd Qu.:0.0000 3rd Qu.:0.6200 3rd Qu.:0.9000 Max. :9.0000 Max. :0.7200 Max. :1.5000 illness reduced health private freepoor Min. :0.000 Min. : 0.0000 Min. : 0.000 no :2892 no :4968 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.000 yes:2298 yes: 222 Median :1.000 Median : 0.0000 Median : 0.000 Mean :1.432 Mean : 0.8619 Mean : 1.218 3rd Qu.:2.000 3rd Qu.: 0.0000 3rd Qu.: 2.000 Max. :5.000 Max. :14.0000 Max. :12.000 freerepat nchronic lchronic no :4099 no :3098 no :4585 yes:1091 yes:2092 yes: 605
As we see we do not have missing values and the visits values ranges from 0 to 9 but it should be of integer type rather than double. Similarly, the variable illness should be converted to factor type since it has a few different values.
doc$visits<-as.integer(doc$visits) doc$illness <- as.factor(doc$illness) tab <- table(doc$visits) tab 0 1 2 3 4 5 6 7 8 9 4141 782 174 30 24 9 12 12 5 1
The best thing we do to start analyzing the data is by displaying the correlation coefficient of each pair variables we have. Thus, any particular predictor that has high correlation with the target variable could be highly likely to be relevant in our future model. Notice that our target variable is not continuous hence we will use the spearman correlation. As required by correlate function from corrr package, all the variables must be of numeric type so we convert all the factor to integer.
doc1 <-modify_if(doc, is.factor, as.integer)
notice that we have stored the result in another object doc1 to keep save our original data.
M <- correlate(doc1, method="spearman") rplot(shave(M), colours=c("red", "white", "blue" ))+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
Looking at this plot all the correlations has low values. however, these correlations assess only the monotonic relations, they say nothing about any other form of relation.
First let’s compare the empirical distribution of the variable visits and the theoretical poisson distribution with \(\lambda\) equals the visits mean mean(doc$visits)
, and the total number of observations is 5190.
pos <- dpois(0:9,0.302)*5190 both <- numeric(20) both[1:20 %% 2 != 0] <- tab both[1:20 %% 2 == 0] <- pos labels<-character(20) labels[1:20 %% 2==0]<-as.character(0:9) barplot(both,col=rep(c("red","yellow"),10),names=labels)
As we see the two distributions are more or less closer to each other. Let’s now check the negative binomial distribution by first estimate the clumping parameter \(k=\frac{\bar x^2}{s^2-\bar x}\).
k<-mean(doc$visits)^2/(var(doc$visits)-mean(doc$visits)) bin<-dnbinom(0:9,0.27,mu=0.302)*5190 both1<-numeric(20) both1[1:20 %% 2 != 0]<-tab both1[1:20 %% 2 == 0]<-bin labels<-character(20) labels[1:20 %% 2==0]<-as.character(0:9) barplot(both1,col=rep(c("red","yellow"),10),names=labels)
With this distribution it seems that the empiricall distribution is more closer to the negative binomial than the poisson distribution.
Note: This data has very large number of zeros for the outcome compared to the other values which means that any trained model that does not take into account this anomaly will be biased to predict more likely the zero value. However, at the end of this article I will show two famous models to handel this type of count data called Haurdle model and zero_inflated model.
Data partition
In oreder to evaluate our model we held out 20% of the data as testing set.
set.seed(123) index<-sample(2,nrow(doc),replace = TRUE,p=c(.8,.2)) train<-doc[index==1,] test<-doc[index==2,]
Poisson model
This model belongs to the generalized linear model families, so in the function glm we set the argument family to poisson. In practice this model is sufficient with a wide range of count data.
set.seed(123) model1<-glm(visits~., data=train, family ="poisson") tidy(model1) # A tibble: 16 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.70 0.141 -19.2 9.14e- 82 2 genderfemale 0.193 0.0620 3.11 1.88e- 3 3 age 0.436 0.184 2.37 1.77e- 2 4 income -0.161 0.0928 -1.74 8.23e- 2 5 illness1 0.944 0.113 8.35 6.76e- 17 6 illness2 1.21 0.118 10.3 1.15e- 24 7 illness3 1.11 0.132 8.43 3.51e- 17 8 illness4 1.28 0.140 9.13 7.14e- 20 9 illness5 1.44 0.139 10.4 2.34e- 25 10 reduced 0.126 0.00560 22.6 6.85e-113 11 health 0.0348 0.0112 3.10 1.91e- 3 12 privateyes 0.111 0.0795 1.39 1.64e- 1 13 freepooryes -0.344 0.190 -1.81 7.00e- 2 14 freerepatyes 0.0377 0.104 0.363 7.16e- 1 15 nchronicyes 0.0186 0.0732 0.254 7.99e- 1 16 lchronicyes 0.0255 0.0916 0.279 7.81e- 1
As we see all the variables are significant except for the income so we remove this variable and reestimate again.
set.seed(123) model1<-glm(visits~.-income, data=train, family ="poisson") tidy(model1) # A tibble: 15 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.83 0.121 -23.4 7.36e-121 2 genderfemale 0.213 0.0609 3.51 4.56e- 4 3 age 0.479 0.183 2.62 8.91e- 3 4 illness1 0.946 0.113 8.38 5.44e- 17 5 illness2 1.21 0.118 10.3 8.29e- 25 6 illness3 1.12 0.132 8.50 1.93e- 17 7 illness4 1.28 0.140 9.17 4.71e- 20 8 illness5 1.45 0.139 10.5 1.05e- 25 9 reduced 0.126 0.00560 22.6 1.12e-112 10 health 0.0350 0.0112 3.11 1.84e- 3 11 privateyes 0.100 0.0793 1.27 2.06e- 1 12 freepooryes -0.290 0.188 -1.55 1.22e- 1 13 freerepatyes 0.0683 0.102 0.667 5.05e- 1 14 nchronicyes 0.0171 0.0731 0.235 8.15e- 1 15 lchronicyes 0.0282 0.0914 0.308 7.58e- 1
For the interpretation of the coefficient estimates, we should exponentiate these values to get the marginal effect since the poisson model uses the log link function to preclude negative values. For continuous predictor, say age, if this predictor increases by one year, ceteris-paribus, we expect the doctor visits will be \(exp(0.47876624)=1.614082\) times larger. whereas, for categorical predictor, say gender, the female has \(exp(0.21342446)=1.23791\) larger doctor visits than male.
By looking at p-values all the predictors are significant. However, we have to check other statistics and metrics.
glance(model1) # A tibble: 1 x 7 null.deviance df.null logLik AIC BIC deviance df.residual <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> 1 4565. 4154 -2685. 5399. 5494. 3486. 4140
since the deviance value 3485.905 is lower than the degrees of freedom 4140, we will then worry about overdispersion problem. Fortunateley, the AER package provides a super easy way to test the significance of this difference via the function dispersiontest.
dispersiontest(model1) Overdispersion test data: model1 z = 6.278, p-value = 1.714e-10 alternative hypothesis: true dispersion is greater than 1 sample estimates: dispersion 1.397176
If our target variable really follows poisson distribution then its variance \(V\) should be approximately equal to its mean \(\mu\), which is the null hypothesis of the following dispersiontest test against the alternative hypothesis that the variance of the form: \[V=\mu+\alpha.trafo(\mu)\]
Where the trafo is an hyperparameter that should be specified as an argument of this test. The popular choices for this argument are:
- trafo = NULL (default): \(V=(1+\alpha)\mu\)
- trafo = 1: \(V=\mu+\alpha.\mu\)
- trafo = 2: \(V=\mu+\alpha.\mu^2\)
For the first choice if true, then the data will be better modeled by quasi-poisson model than poisson model.
For the last ones if one of them is true then the negative binomial will be better than poisson model.
Now once the trafo is defined the test estimates \(\alpha\), such that:
- if \(\alpha = 0\) : equidispersion (The null hypothesis)
- if \(\alpha < 0\) : underdispersion
- if \(\alpha > 0\) : overdispersion
Therefore, the result of the test will depend on the direction of the test, where we have two.sided, greater (default) for the overdispersion, and less for underdispersion.
With this in mind the output of the above test (with the default values) tested the overdispersion against the quasi-poisson model, and since the p-value is very small 1.714e-10 then we have overdispersion problem, suggesting the use of quasi-poisson model instead.
Now let’s test the negative binomial now.
dispersiontest(model1, trafo = 1) Overdispersion test data: model1 z = 6.278, p-value = 1.714e-10 alternative hypothesis: true alpha is greater than 0 sample estimates: alpha 0.3971763
The test suggested the use of negative binomial with linear function for the variance with very tiny p-value 1.714e-10. This model is known as NB1 (with linear variance function).
dispersiontest(model1, trafo = 2) Overdispersion test data: model1 z = 7.4723, p-value = 3.939e-14 alternative hypothesis: true alpha is greater than 0 sample estimates: alpha 0.95488
If the relation is in quadratic form then this model is called NB2. And since this p-value 3.939e-14 is smaller than the previous one then NB2 could be more appropriate than NB1.
Quasi poisson model
The first test suggested the use of quasi-poisson model, so let’s train this model with the same predictors as the previous one.
set.seed(123) model2<-glm(visits~.-income, data=train, family ="quasipoisson") tidy(model2) # A tibble: 15 x 5 term estimate std.error statistic p.value <chr> <dbl> <dbl> <dbl> <dbl> 1 (Intercept) -2.83 0.140 -20.2 1.25e-86 2 genderfemale 0.213 0.0705 3.03 2.47e- 3 3 age 0.479 0.212 2.26 2.39e- 2 4 illness1 0.946 0.131 7.24 5.36e-13 5 illness2 1.21 0.136 8.89 9.15e-19 6 illness3 1.12 0.152 7.34 2.49e-13 7 illness4 1.28 0.162 7.92 2.91e-15 8 illness5 1.45 0.160 9.06 2.01e-19 9 reduced 0.126 0.00647 19.5 4.79e-81 10 health 0.0350 0.0130 2.69 7.14e- 3 11 privateyes 0.100 0.0917 1.09 2.74e- 1 12 freepooryes -0.290 0.217 -1.34 1.81e- 1 13 freerepatyes 0.0683 0.119 0.576 5.64e- 1 14 nchronicyes 0.0171 0.0846 0.203 8.39e- 1 15 lchronicyes 0.0282 0.106 0.266 7.90e- 1
This model uses the quasi maximum likelihood which gives the same coefficient estimates but with different (corrected) standard errors.
Since here also all the variables are significant We see that the models are the same except the correction of the standard errors which are now more larger. In other words, the poisson distribution under overdispersion underestimates the standard errors and hence the t test would be biased towards the rejection of the null hypothesis.
To better understand what is going on with quasi-poisson model let’s put the estimates and the standard errors of both models into one table, and we add a column that resulted from dividing the second standard errors vector by the first one.
D1 <- tidy(model1) colnames(D1) <- NULL D2 <- tidy(model2) colnames(D2) <- NULL cbind(term=D1[,1], estimate1=D1[,2], std1=D1[,3],estimate2=D2[,2], std2=D2[,3]) %>% mutate(dispersion= std2/std1) term estimate1 std1 estimate2 std2 dispersion 1 (Intercept) -2.82868582 0.121003941 -2.82868582 0.140023371 1.15718 2 genderfemale 0.21342446 0.060883445 0.21342446 0.070453120 1.15718 3 age 0.47876624 0.183054884 0.47876624 0.211827495 1.15718 4 illness1 0.94643852 0.112984298 0.94643852 0.130743198 1.15718 5 illness2 1.21007304 0.117661029 1.21007304 0.136155018 1.15718 6 illness3 1.11941845 0.131726935 1.11941845 0.152431807 1.15718 7 illness4 1.28108306 0.139695671 1.28108306 0.161653071 1.15718 8 illness5 1.45198517 0.138529459 1.45198517 0.160303554 1.15718 9 reduced 0.12621548 0.005595111 0.12621548 0.006474552 1.15718 10 health 0.03497647 0.011228772 0.03497647 0.012993714 1.15718 11 privateyes 0.10033918 0.079266327 0.10033918 0.091725427 1.15718 12 freepooryes -0.29008562 0.187570838 -0.29008562 0.217053268 1.15718 13 freerepatyes 0.06829904 0.102422323 0.06829904 0.118521089 1.15718 14 nchronicyes 0.01714304 0.073082122 0.01714304 0.084569188 1.15718 15 lchronicyes 0.02818992 0.091425522 0.02818992 0.105795808 1.15718
Note: The first two columns are for the model1, and the last one are for the model 2. Not surprisingly that the result of the last column is constant since this is exactely what the quasi maximum likelihood does, it computes the corrected standard errors from the original ones as follows \(std2=dispersion*std1\), with the dispersion value being estimated as 1.15718. if you want to know where this value came from, the answer is simple. this model computes the sigma of the standardized residuals resulted from the original model. we can thus get this value by specifying the argument type to pear then computing sigma by hand as follows:
resid <- resid(model1, type = "pear") sqrt(sum(resid^2)/4140) [1] 1.15718
Now to test the prediction qualities of our models we use the testing set test by ploting the original and the predicted values. Let’s start by the model1
pred<- predict.glm(model1,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(pred),col="blue")
If you noticed, and due to the large number of zero’s of the target variable, i have intentionally removed all theses values in order to get clearer plot. From this plot we can say that the model does not fit well the data especially the larger values that are not well captured, however this may due to the fact that the data are very skewed towards zero.
To compare different models we can use the root mean-square error and mean absolute error (all the data with zero’s included).
Note: Here we are using the rmse function from ModelMetrics that expects the inpute to be two vectors, and not that with the same name from the performance package that expects the input to be a model object . To avoid thus any such ambiguity you should type this command ModelMetrics::rmse
.
pred <- predict.glm(model1, newdata = test, type = "response") rmsemodelp <- ModelMetrics::rmse(test$visits,round(pred)) maemodelp <- mae(test$visits,round(pred)) rmsemodelp [1] 0.7381921 maemodelp [1] 0.284058
By the same way, Now let’s evaluate the quasi-poisson model.
predq<- predict.glm(model2,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predq),col="blue")
This plot does not seem to be very different from the previous plot. The rmse and mae for this model are computed as follows.
predq <- predict.glm(model2,newdata=test, type = "response") rmsemodelqp <- ModelMetrics::rmse(test$visits,round(predq)) maemodelqp <- mae(test$visits,round(predq)) rmsemodelqp [1] 0.7381921 maemodelqp [1] 0.284058
we will not compare this two models until we finish with all the incoming models and we compare all the models at once.
Negative binomial model
The negative binomial distribution is used as an alternative for the poisson distribution under overdispersion problem.
set.seed(123) model3<-glm.nb(visits~.-income, data=train) summary(model3) Call: glm.nb(formula = visits ~ . - income, data = train, init.theta = 0.9715923611, link = log) Deviance Residuals: Min 1Q Median 3Q Max -1.8413 -0.6894 -0.5335 -0.3540 3.6726 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.894820 0.135760 -21.323 < 2e-16 *** genderfemale 0.258968 0.075352 3.437 0.000589 *** age 0.511867 0.230297 2.223 0.026240 * illness1 0.880644 0.123264 7.144 9.04e-13 *** illness2 1.171615 0.130240 8.996 < 2e-16 *** illness3 1.118067 0.149032 7.502 6.28e-14 *** illness4 1.263367 0.165370 7.640 2.18e-14 *** illness5 1.378166 0.169907 8.111 5.01e-16 *** reduced 0.141389 0.008184 17.275 < 2e-16 *** health 0.041364 0.015029 2.752 0.005918 ** privateyes 0.086188 0.095173 0.906 0.365149 freepooryes -0.375471 0.223857 -1.677 0.093487 . freerepatyes 0.144928 0.127751 1.134 0.256602 nchronicyes 0.022111 0.087590 0.252 0.800705 lchronicyes 0.091622 0.114965 0.797 0.425477 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(0.9716) family taken to be 1) Null deviance: 3208.0 on 4154 degrees of freedom Residual deviance: 2431.2 on 4140 degrees of freedom AIC: 5159.6 Number of Fisher Scoring iterations: 1 Theta: 0.972 Std. Err.: 0.103 2 x log-likelihood: -5127.587
As before we visualize the performance of this model as follows.
prednb<- predict.glm(model3,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(prednb),col="blue")
Again this plot also seems to be the same as the previous ones, so to figure out which model is best we use statistic metrics.
prednb<- predict.glm(model3,newdata=test,type = "response") rmsemodelnb<-ModelMetrics::rmse(test$visits,round(prednb)) maemodelnb<-mae(test$visits,round(prednb)) rmsemodelnb [1] 0.7808085 maemodelnb [1] 0.2966184
we will use these ouputs further.
Hurdle model
Originally proposed by Mullahy (1986) this model can take into account the fact that the data has more zeros and also can handle the overdispersion problem. It has two components (or steps), truncated count component defined by the chosen discrete distribution such as poisson or negative binomial, and a hurdle components models zero vs larger counts (that uses censored count distribution or binomial model). In other words, this models asumes that two population distributions underlying the data, one distribution for zero values, and another different distribution the psotive values. For more detail about hurdle and zero inflated models click here
To perform this model we make use of the function hurdle from the package pscl.
hurdle model with poisson distribution.
This model works in two steps. In the first step it uses binary classification to discriminate between the zero values and the positive values, and in the second step uses the traditional (poisson or binomial model, and here we use poisson model) model for positive values.
library(pscl) set.seed(123) modelhp<-hurdle(visits~. -income, data=train,dist = "poisson") summary(modelhp) Call: hurdle(formula = visits ~ . - income, data = train, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max -1.5464 -0.4686 -0.3306 -0.2075 11.0887 Count model coefficients (truncated poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.977535 0.261835 -3.733 0.000189 *** genderfemale 0.073326 0.098034 0.748 0.454480 age 0.032762 0.287991 0.114 0.909427 illness1 0.370071 0.251920 1.469 0.141833 illness2 0.403514 0.256363 1.574 0.115489 illness3 0.201724 0.278757 0.724 0.469277 illness4 0.420285 0.277573 1.514 0.129990 illness5 0.762209 0.269809 2.825 0.004728 ** reduced 0.111640 0.007967 14.013 < 2e-16 *** health 0.007682 0.016452 0.467 0.640554 privateyes -0.215649 0.129860 -1.661 0.096787 . freepooryes 0.066277 0.269699 0.246 0.805879 freerepatyes -0.434941 0.166196 -2.617 0.008870 ** nchronicyes 0.109660 0.125380 0.875 0.381779 lchronicyes 0.135612 0.142766 0.950 0.342166 Zero hurdle model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.224621 0.156469 -20.609 < 2e-16 *** genderfemale 0.305324 0.089648 3.406 0.000660 *** age 0.700345 0.276089 2.537 0.011191 * illness1 0.885148 0.136951 6.463 1.02e-10 *** illness2 1.238227 0.146059 8.478 < 2e-16 *** illness3 1.263698 0.169344 7.462 8.50e-14 *** illness4 1.405167 0.195388 7.192 6.40e-13 *** illness5 1.445585 0.208425 6.936 4.04e-12 *** reduced 0.154858 0.013488 11.481 < 2e-16 *** health 0.070464 0.019142 3.681 0.000232 *** privateyes 0.271192 0.112751 2.405 0.016163 * freepooryes -0.546177 0.277942 -1.965 0.049406 * freerepatyes 0.423220 0.153994 2.748 0.005991 ** nchronicyes -0.006256 0.102033 -0.061 0.951106 lchronicyes 0.070658 0.140587 0.503 0.615251 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 22 Log-likelihood: -2581 on 30 Df
As we see this output has two tables. The above one is for the poisson model performed only on the truncated positive values, and the below one is the result of the logistic regression with only two classes (zero or positive value)
As we did before we plot the results.
predhp<- predict(modelhp,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predhp),col="blue")
As before by only looking at the plot we can not decide which model is the best. So it is better to use the statistic metrics.
predhp<- predict(modelhp,newdata=test, type = "response") rmsemodelhp<-ModelMetrics::rmse(test$visits,round(predhp)) maemodelhp<-mae(test$visits,round(predhp)) rmsemodelhp [1] 0.7375374 maemodelhp [1] 0.2850242
hurdle model with negative binomial distribution.
Now let’s try to use the negative binomial instead of poisson distribution.
set.seed(123) modelhnb<-hurdle(visits~.-income, data=train,dist = "negbin") summary(modelhnb) Call: hurdle(formula = visits ~ . - income, data = train, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -0.9078 -0.4515 -0.3201 -0.2022 10.6552 Count model coefficients (truncated negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.68462 2.65037 -1.390 0.1645 genderfemale 0.07432 0.17299 0.430 0.6675 age 0.26774 0.54001 0.496 0.6200 illness1 0.27678 0.34564 0.801 0.4233 illness2 0.37093 0.35241 1.053 0.2925 illness3 0.04728 0.39747 0.119 0.9053 illness4 0.40386 0.40517 0.997 0.3189 illness5 0.68213 0.41357 1.649 0.0991 . reduced 0.15813 0.01935 8.171 3.05e-16 *** health 0.01891 0.03291 0.575 0.5656 privateyes -0.45711 0.23118 -1.977 0.0480 * freepooryes 0.03334 0.55282 0.060 0.9519 freerepatyes -0.59189 0.30437 -1.945 0.0518 . nchronicyes 0.08737 0.21061 0.415 0.6783 lchronicyes 0.30274 0.25846 1.171 0.2415 Log(theta) -2.80552 2.80120 -1.002 0.3166 Zero hurdle model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -3.224621 0.156469 -20.609 < 2e-16 *** genderfemale 0.305324 0.089648 3.406 0.000660 *** age 0.700345 0.276089 2.537 0.011191 * illness1 0.885148 0.136951 6.463 1.02e-10 *** illness2 1.238227 0.146059 8.478 < 2e-16 *** illness3 1.263698 0.169344 7.462 8.50e-14 *** illness4 1.405167 0.195388 7.192 6.40e-13 *** illness5 1.445585 0.208425 6.936 4.04e-12 *** reduced 0.154858 0.013488 11.481 < 2e-16 *** health 0.070464 0.019142 3.681 0.000232 *** privateyes 0.271192 0.112751 2.405 0.016163 * freepooryes -0.546177 0.277942 -1.965 0.049406 * freerepatyes 0.423220 0.153994 2.748 0.005991 ** nchronicyes -0.006256 0.102033 -0.061 0.951106 lchronicyes 0.070658 0.140587 0.503 0.615251 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta: count = 0.0605 Number of iterations in BFGS optimization: 31 Log-likelihood: -2524 on 31 Df
And let’s plot the difference between the predicted and the actual values of the testing set. .
predhnb<- predict(modelhnb,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predhnb),col="blue")
And for the metrics.
predhnb<- predict(modelhnb,newdata=test,type = "response") rmsemodelhnb<-ModelMetrics::rmse(test$visits,round(predhnb)) maemodelhnb<-mae(test$visits,round(predhnb)) rmsemodelhnb [1] 0.7408052 maemodelhnb [1] 0.2879227
Zero inflated model
Such as the previous model type , this model also combines two components but with the difference that this model performs a mixture of binomial distribution (between zero and positive values) and the poisson (or negative binomial) distribution for the rest of the values (with the zero included).
Zero inflated model with poisson distribution
Here also we fit tow models one with poisson and one with negative binomial
set.seed(123) modelzp<-zeroinfl(visits~.-income, data=train,dist = "poisson") summary(modelzp) Call: zeroinfl(formula = visits ~ . - income, data = train, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max -1.6247 -0.4791 -0.3326 -0.1783 12.3448 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.86466 0.23608 -3.663 0.00025 *** genderfemale 0.03280 0.09078 0.361 0.71788 age 0.13330 0.26922 0.495 0.62050 illness1 0.32985 0.21846 1.510 0.13107 illness2 0.34799 0.22426 1.552 0.12072 illness3 0.20399 0.24152 0.845 0.39833 illness4 0.44019 0.24324 1.810 0.07034 . illness5 0.72462 0.23632 3.066 0.00217 ** reduced 0.09679 0.00809 11.964 < 2e-16 *** health 0.02269 0.01609 1.410 0.15855 privateyes -0.26390 0.12796 -2.062 0.03918 * freepooryes 0.04860 0.27675 0.176 0.86060 freerepatyes -0.51895 0.17070 -3.040 0.00236 ** nchronicyes 0.08577 0.11489 0.747 0.45536 lchronicyes 0.10876 0.12745 0.853 0.39347 Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.44814 0.34932 7.008 2.41e-12 *** genderfemale -0.48766 0.19038 -2.561 0.010422 * age -0.88817 0.59030 -1.505 0.132426 illness1 -0.80835 0.31247 -2.587 0.009683 ** illness2 -1.41462 0.35338 -4.003 6.25e-05 *** illness3 -1.69205 0.44028 -3.843 0.000121 *** illness4 -1.52225 0.46334 -3.285 0.001018 ** illness5 -1.08745 0.46493 -2.339 0.019338 * reduced -0.14462 0.03861 -3.746 0.000180 *** health -0.05795 0.04486 -1.292 0.196418 privateyes -0.73945 0.22597 -3.272 0.001066 ** freepooryes 0.73370 0.41402 1.772 0.076372 . freerepatyes -1.75457 0.53937 -3.253 0.001142 ** nchronicyes 0.13230 0.22623 0.585 0.558678 lchronicyes 0.03646 0.30619 0.119 0.905217 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 80 Log-likelihood: -2579 on 30 Df predzp<- predict(modelzp,newdata=test[test$visits!=0,],type = "response") plot(test$visits[test$visits!=0],type = "b",col="red") lines(round(predzp),col="blue")
predzp<- predict(modelzp,newdata=test,type = "response") rmsemodelzp<-ModelMetrics::rmse(test$visits,round(predzp)) maemodelzp<-mae(test$visits,round(predzp)) rmsemodelzp [1] 0.7485897 maemodelzp [1] 0.2898551
Zero inflated model with negative binomial distribution
Let’s this time try the negative binomial distribution.
set.seed(123) modelznb<-zeroinfl(visits~., data=train,dist = "negbin") summary(modelznb) Call: zeroinfl(formula = visits ~ ., data = train, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -1.0440 -0.4582 -0.3031 -0.1680 14.2061 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) -1.383088 0.241780 -5.720 1.06e-08 *** genderfemale 0.038982 0.090534 0.431 0.666775 age 0.028836 0.290843 0.099 0.921022 income -0.196594 0.147560 -1.332 0.182763 illness1 0.613261 0.174653 3.511 0.000446 *** illness2 0.692295 0.179665 3.853 0.000117 *** illness3 0.664614 0.196063 3.390 0.000699 *** illness4 0.760167 0.204138 3.724 0.000196 *** illness5 0.944761 0.206098 4.584 4.56e-06 *** reduced 0.102651 0.008776 11.697 < 2e-16 *** health 0.044012 0.015536 2.833 0.004612 ** privateyes -0.168877 0.138680 -1.218 0.223321 freepooryes -0.422737 0.306654 -1.379 0.168034 freerepatyes -0.383568 0.163995 -2.339 0.019341 * nchronicyes 0.033375 0.107881 0.309 0.757040 lchronicyes 0.065839 0.128987 0.510 0.609748 Log(theta) 0.473940 0.142626 3.323 0.000891 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) 2.280216 0.532170 4.285 1.83e-05 *** genderfemale -0.726916 0.275312 -2.640 0.00828 ** age -2.002886 0.920168 -2.177 0.02951 * income -0.180281 0.393267 -0.458 0.64665 illness1 -0.332734 0.347965 -0.956 0.33896 illness2 -1.111993 0.449603 -2.473 0.01339 * illness3 -0.953342 0.512702 -1.859 0.06296 . illness4 -1.551433 0.739821 -2.097 0.03599 * illness5 -1.229820 0.859693 -1.431 0.15256 reduced -1.298180 0.457644 -2.837 0.00456 ** health -0.001444 0.055085 -0.026 0.97909 privateyes -0.817933 0.317762 -2.574 0.01005 * freepooryes 0.239439 0.664835 0.360 0.71874 freerepatyes -13.563347 519.415149 -0.026 0.97917 nchronicyes 0.045017 0.298234 0.151 0.88002 lchronicyes -0.163689 0.495085 -0.331 0.74093 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 1.6063 Number of iterations in BFGS optimization: 68 Log-likelihood: -2512 on 33 Df predznb<- predict(modelznb,newdata=test,type = "response") rmsemodelznb<-ModelMetrics::rmse(test$visits,round(predznb)) maemodelznb<-mae(test$visits,round(predznb)) rmsemodelznb [1] 0.7309579 maemodelznb [1] 0.2753623
Finally let’s compare all the above models.
rmse<-c(rmsemodelp,rmsemodelqp,rmsemodelnb,rmsemodelhp,rmsemodelhnb, rmsemodelzp,rmsemodelznb) mae<-c(maemodelp,maemodelqp,maemodelnb,maemodelhp,maemodelhnb, maemodelzp,maemodelznb) models<-c("pois","q_pois","nb","h_pois","h_nb","zer_pois","zer_nb") data.frame(models,rmse,mae)%>% arrange(rmse) models rmse mae 1 zer_nb 0.7309579 0.2753623 2 h_pois 0.7375374 0.2850242 3 pois 0.7381921 0.2840580 4 q_pois 0.7381921 0.2840580 5 h_nb 0.7408052 0.2879227 6 zer_pois 0.7485897 0.2898551 7 nb 0.7808085 0.2966184
Both metrics have chosen the zero inflated negative binomial model as the best model with minimum rmse value 0.7309579, and minimum mae value 0.2753623. this result is in line with the fact that this kind of models take care of the zero inflated data and at the same time the overdispersion problem.
Conclusion:
If the data is truly follows Poisson distribution then all the the other models have extra parameters that, during training process, converges to the optimum parameter values for poisson, this relation is like the linear regression to the Generealized least squares. However, if the data is very skewed towards zero then it should be better to use the last two models to take care of this issue.
Furhter reading:
- Michael J. Crawley, The R book, WILEY, UK, 2013. http://www.bio.ic.ac.uk/research/mjcraw/therbook/index.htm
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.