More Flexible Approaches to Model Frequency
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(The post below is motivated by my friend Matt Flynn https://www.linkedin.com/in/matthew-flynn-1b443b11)
In the context of operational loss forecast models, the standard Poisson regression is the most popular way to model frequency measures. Conceptually speaking, there is a restrictive assumption for the standard Poisson regression, namely Equi-Dispersion, which requires the equality between the conditional mean and the variance such that E(Y) = var(Y). However, in real-world frequency outcomes, the assumption of Equi-Dispersion is always problematic. On the contrary, the empirical data often presents either an excessive variance, namely Over-Dispersion, or an insufficient variance, namely Under-Dispersion. The application of a standard Poisson regression to the over-dispersed data will lead to deflated standard errors of parameter estimates and therefore inflated t-statistics.
In cases of Over-Dispersion, the Negative Binomial (NB) regression has been the most common alternative to the standard Poisson regression by including a dispersion parameter to accommodate the excessive variance in the data. In the formulation of NB regression, the variance is expressed as a quadratic function of the conditional mean such that the variance is guaranteed to be higher than the conditional mean. However, it is not flexible enough to allow for both Over-Dispersion and Under-Dispersion. Therefore, more generalizable approaches are called for.
Two additional frequency modeling methods, including Quasi-Poisson (QP) regression and Conway-Maxwell Poisson (CMP) regression, are discussed. In the case of Quasi-Poisson, E(Y) = λ and var(Y) = θ • λ. While θ > 1 addresses Over-Dispersion, θ < 1 governs Under-Dispersion. Since QP regression is estimated with QMLE, likelihood-based statistics, such as AIC and BIC, won’t be available. Instead, quasi-AIC and quasi-BIC are provided. In the case of Conway-Maxwell Poisson, E(Y) = λ ** (1 / v) – (v – 1) / (2 • v) and var(Y) = (1 / v) • λ ** (1 / v), where λ doesn’t represent the conditional mean anymore but a location parameter. While v < 1 enables us to model the long-tailed distribution reflected as Over-Dispersion, v > 1 takes care of the short-tailed distribution reflected as Under-Dispersion. Since CMP regression is estimated with MLE, likelihood-based statistics, such as AIC and BIC, are available at a high computing cost.
Below demonstrates how to estimate QP and CMP regressions with R and a comparison of their computing times. If the modeling purpose is mainly for the prediction without focusing on the statistical reference, QP regression would be an excellent choice for most practitioners. Otherwise, CMP regression is an elegant model to address various levels of dispersion parsimoniously.
# data source: www.jstatsoft.org/article/view/v027i08 load("../Downloads/DebTrivedi.rda") library(rbenchmark) library(CompGLM) benchmark(replications = 3, order = "user.self", quasi.poisson = { m1 <- glm(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, family = "quasipoisson") }, conway.maxwell = { m2 <- glm.comp(ofp ~ health + hosp + numchron + privins + school + gender + medicaid, data = DebTrivedi, lamStart = m1$coefficient s) } ) # test replications elapsed relative user.self sys.self user.child # 1 quasi.poisson 3 0.084 1.000 0.084 0.000 0 # 2 conway.maxwell 3 42.466 505.548 42.316 0.048 0 summary(m1) summary(m2)
Quasi-Poisson Regression
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.886462 0.069644 12.729 < 2e-16 *** healthpoor 0.235673 0.046284 5.092 3.69e-07 *** healthexcellent -0.360188 0.078441 -4.592 4.52e-06 *** hosp 0.163246 0.015594 10.468 < 2e-16 *** numchron 0.144652 0.011894 12.162 < 2e-16 *** privinsyes 0.304691 0.049879 6.109 1.09e-09 *** school 0.028953 0.004812 6.016 1.93e-09 *** gendermale -0.092460 0.033830 -2.733 0.0063 ** medicaidyes 0.297689 0.063787 4.667 3.15e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 6.697556) Null deviance: 26943 on 4405 degrees of freedom Residual deviance: 23027 on 4397 degrees of freedom AIC: NA
Conway-Maxwell Poisson Regression
Beta: Estimate Std.Error t.value p.value (Intercept) -0.23385559 0.16398319 -1.4261 0.15391 healthpoor 0.03226830 0.01325437 2.4345 0.01495 * healthexcellent -0.08361733 0.00687228 -12.1673 < 2e-16 *** hosp 0.01743416 0.01500555 1.1618 0.24536 numchron 0.02186788 0.00209274 10.4494 < 2e-16 *** privinsyes 0.05193645 0.00184446 28.1581 < 2e-16 *** school 0.00490214 0.00805940 0.6083 0.54305 gendermale -0.01485663 0.00076861 -19.3292 < 2e-16 *** medicaidyes 0.04861617 0.00535814 9.0733 < 2e-16 *** Zeta: Estimate Std.Error t.value p.value (Intercept) -3.4642316 0.0093853 -369.11 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 AIC: 24467.13 Log-Likelihood: -12223.56
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.