Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Yesterday, I was chating with a young and enthousiastic actuary, who asked a nice (and classical) question: is it the same, or not to use a Tweedie regression, or two regressions (Poisson, and Gamma). For distributions, the two are equivalent, but when we have heterogeneity and explanatory variable, I really think that using all information, and running two regressions is much more interesting.
Homogeneous case
In the homogenous case, without any explanatory variable, the Tweedie distribution and compound Poisson-gamma distribution are equivalent representation (i.e., it is simply a reparametrization)
Consider a Tweedie distribution, with variance function power \(p\in(1,2)\), mean \(\mu\) and scale parameter \(\phi\), then it is a compound Poisson model,
- \(N\sim\mathcal{P}(\lambda)\) with \(\lambda=\displaystyle{\frac{\phi \mu^{2-p}}{2-p}}\)
- \(Y_i\sim\mathcal{G}(\alpha,\beta)\) with \(\alpha=\displaystyle{-\frac{p-2}{p-1}}\text{~and~}\beta=\displaystyle{\frac{\phi \mu^{1-p}}{p-1}}\)
Conversely, consider a compound Poisson model \(N\sim\mathcal{P}(\lambda)\) and \(Y_i\sim\mathcal{G}(\alpha,\beta)\), then
- variance function power is \(p=\displaystyle{\frac{\alpha+2}{\alpha+1}}\)
- mean is \(\mu=\displaystyle{\frac{\lambda \alpha}{\beta}}\)
- scale (nuisance) parameter is
\(\phi=\displaystyle{\frac{[\lambda\alpha]^{\frac{\alpha+2}{\alpha+1}-1}\beta^{2-\frac{\alpha+2}{\alpha+1}}}{\alpha+1}}\)
So the two are equivalent…
Heterogeneous case
Now, in the context of regression\(N_i\sim\mathcal{P}(\lambda_i)\text{ with }\lambda_i=\exp[\boldsymbol{x}_i^\top\boldsymbol{\beta}_{\lambda}]\)
and\(Y_{j,i}\sim\mathcal{G}(\mu_i,\phi)\text{ with }\mu_i=\exp[\boldsymbol{x}_i^\top\boldsymbol{\beta}_{\mu}]\)
Then \(S_i=Y_{1,i}+\cdots+Y_{N,i}\) has a Tweedie distribution
- variance function power is \(p=\displaystyle{\frac{\phi+2}{\phi+1}}\)
- mean is \(\lambda_i \mu_i\)
- scale parameter is\(\displaystyle{\frac{\lambda_i^{\frac{1}{\phi+1}-1}}{\mu_i^{\frac{\phi}{\phi+1}}}\left(\frac{\phi}{1+\phi}\right)}\)
There are \(1+2\text{dim}(\boldsymbol{X})\) degrees of freedom here. And a Tweedie regression is
- variance function power is \(p\in(1,2)\)
- mean is \(\mu_i=\exp[\boldsymbol{x}_i^{\top}\boldsymbol{\beta}_{\text{Tweedie}}]\)
- scale parameter is \(\phi\)
There are now \(2+\text{dim}(\boldsymbol{X})\) degrees of freedom.
In the actuarial terminology
- \(N\) is the annual claim frequency
- \(Y\) is the cost of single claims
- \(S\) is the annual cost for a single insurance policy
As explained in our book, frequency and costs can be explained by different features, so that itself is a motifivation to consider two models. But consider the following simulated data
n = 1e4
a=2
set.seed(123)
x = runif(n)
etan = exp(-2+a*x)
N = rpois(n,etan)
dfn = data.frame(y=N,x=x)
I=rep(1:n,N)
etaz = exp(2-a*x[I])
Z = rgamma(sum(N),etaz,20)
dfz = data.frame(y=Z,x=x[I])
S=tapply(Z,as.factor(I),sum)
V=as.numeric(S[as.character(1:n)])
V[is.na(V)]=0
dfy = data.frame(y=V,x=x)
We can run two regressions, for the frequency, and for the costs
regn = glm(y~x, family=poisson(link="log"),data=dfn)
regz = glm(y~x, family=Gamma(link="log"),data=dfz)
For the tweedie regression, let us find the optimal power parameter
library(statmod)
library(tweedie)
glmtw = function(t){
m = glm(y~x, family=tweedie(var.power = t, link.power = 0),data=dfy)
d = NULL
if(t == 1) d = 1
AICtweedie(m, dispersion = d)
}
vt = seq(1.01,1.99,length=251)
vg = Vectorize(glmtw)(vt)
plot(vt,vg,log="y",type="l")
i=which.min(vg)
and consider the associated Tweedie regression.
regy = glm(y~x, family=tweedie(var.power = vt[i], link.power = 0),data=dfy)
For frequency, there is a clear increase of the average frequency with \(x\) (and significant)
summary(regy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.00822 0.04101 -73.356 <2e-16 ***
x -0.02226 0.07154 -0.311 0.756
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 0.6516459)
For the individual costs, there is a clear decline of the average cost with \(x\) (and highly significant)
summary(regn)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.01508 0.04135 -48.73 <2e-16 ***
x 1.99036 0.05887 33.81 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Now, if we consider the average cost for the policy, we have
summary(regy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.00822 0.04101 -73.356 <2e-16 ***
x -0.02226 0.07154 -0.311 0.756
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Tweedie family taken to be 0.6516459)
I.e., the average annual cost for a single policy does not depend on \(x\) (it is clearly not significant). As the product of the frequency and the average costs tells more or less the same story…
If the outcome, the price, is the same, one could agree that having here the two regressions is much more informative for risk management (if one wants to introduce deductibles for instance).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.