Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the development of operational loss forecasting models, the Frequency-Severity modeling approach, which the frequency and the severity of a Unit of Measure (UoM) are modeled separately, has been widely employed in the banking industry. However, sometimes it also makes sense to model the operational loss directly, especially for UoMs with non-material losses. First of all, given the low loss amount, the effort of developing two models, e.g. frequency and severity, might not be justified. Secondly, for UoMs with low losses due to low frequencies, modeling the frequency and the severity separately might overlook the internal connection between the low frequency and the subsequent low loss amount. For instance, when the frequency N = 0, then the loss L = $0 inevitably.
The Tweedie distribution is defined as a Poisson sum of Gamma random variables. In particular, if the frequency of loss events N is assumed a Poisson distribution and the loss amount L_i of an event i, where i = 0, 1 … N, is assumed a Gamma distribution, then the total loss amount L = SUM[L_i] would have a Tweedie distribution. When there is no loss event, e.g. N = 0, then Prob(L = $0) = Prob(N = 0) = Exp(-Lambda). However, when N > 0, then L = L_1 + … + L_N > $0 is governed by a Gamma distribution, e.g. sum of I.I.D. Gamma also being Gamma.
For the Tweedie loss, E(L) = Mu and VAR(L) = Phi * (Mu ** P), where P is called the index parameter and Phi is the dispersion parameter. When P approaches 1 and therefore VAR(L) approaches Phi * E(L), the Tweedie would be similar to a Poisson-like distribution. When P approaches 2 and therefore VAR(L) approaches Phi * (E(L) ** 2), the Tweedie would be similar to a Gamma distribution. When P is between 1 and 2, then the Tweedie would be a compound mixture of Poisson and Gamma, where P and Phi can be estimated.
To estimate a regression with the Tweedie distributional assumption, there are two implementation approaches in R with cplm and statmod packages respectively. With the cplm package, the Tweedie regression can be estimated directly as long as P is in the range of (1, 2), as shown below. In the example, the estimated index parameter P is 1.42.
> library(cplm) > data(FineRoot) > m1 <- cpglm(RLD ~ Zone + Stock, data = FineRoot) > summary(m1) Deviance Residuals: Min 1Q Median 3Q Max -1.0611 -0.6475 -0.3928 0.1380 1.9627 Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95141 0.14643 -13.327 < 2e-16 *** ZoneOuter -0.85693 0.13292 -6.447 2.66e-10 *** StockMM106 0.01177 0.17535 0.067 0.947 StockMark -0.83933 0.17476 -4.803 2.06e-06 *** --- Estimated dispersion parameter: 0.35092 Estimated index parameter: 1.4216 Residual deviance: 203.91 on 507 degrees of freedom AIC: -157.33
The statmod package provides a more general and flexible solution with the two-stage estimation, which will estimate the P parameter first and then estimate regression parameters. In the real-world practice, we could do a coarse search to narrow down a reasonable range of P and then do a fine search to identify the optimal P value. As shown below, all estimated parameters are fairly consistent with ones in the previous example.
> library(tweedie) > library(statmod) > prof <- tweedie.profile(RLD ~ Zone + Stock, data = FineRoot, p.vec = seq(1.1, 1.9, 0.01), method = "series") 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.3 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.4 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.6 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.7 1.71 1.72 1.73 1.74 1.75 1.76 1.77 1.78 1.79 1.8 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1.9 .................................................................................Done. > prof$p.max [1] 1.426531 > m2 <- glm(RLD ~ Zone + Stock, data = FineRoot, family = tweedie(var.power = prof$p.max, link.power = 0)) > summary(m2) Deviance Residuals: Min 1Q Median 3Q Max -1.0712 -0.6559 -0.3954 0.1380 1.9728 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.95056 0.14667 -13.299 < 2e-16 *** ZoneOuter -0.85823 0.13297 -6.454 2.55e-10 *** StockMM106 0.01204 0.17561 0.069 0.945 StockMark -0.84044 0.17492 -4.805 2.04e-06 *** --- (Dispersion parameter for Tweedie family taken to be 0.4496605) Null deviance: 241.48 on 510 degrees of freedom Residual deviance: 207.68 on 507 degrees of freedom AIC: NA
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.