Tweedie regression, or Poisson-Gamma regressions ?

arthur charpentier

9 hours ago

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Yesterday, I was chating with a young and enthousiastic actuary, who asked a nice (and classical) question: is it the same, or not to use a Tweedie regression, or two regressions (Poisson, and Gamma). For distributions, the two are equivalent, but when we have heterogeneity and explanatory variable, I really think that using all information, and running two regressions is much more interesting.

Homogeneous case

In the homogenous case, without any explanatory variable, the Tweedie distribution and compound Poisson-gamma distribution are equivalent representation (i.e., it is simply a reparametrization)

Consider a Tweedie distribution, with variance function power \(p\in(1,2)\), mean \(\mu\) and scale parameter \(\phi\), then it is a compound Poisson model,

\(N\sim\mathcal{P}(\lambda)\) with \(\lambda=\displaystyle{\frac{\phi \mu^{2-p}}{2-p}}\)
\(Y_i\sim\mathcal{G}(\alpha,\beta)\) with \(\alpha=\displaystyle{-\frac{p-2}{p-1}}\text{~and~}\beta=\displaystyle{\frac{\phi \mu^{1-p}}{p-1}}\)

Conversely, consider a compound Poisson model \(N\sim\mathcal{P}(\lambda)\) and \(Y_i\sim\mathcal{G}(\alpha,\beta)\), then

variance function power is \(p=\displaystyle{\frac{\alpha+2}{\alpha+1}}\)
mean is \(\mu=\displaystyle{\frac{\lambda \alpha}{\beta}}\)
scale (nuisance) parameter is
\(\phi=\displaystyle{\frac{[\lambda\alpha]^{\frac{\alpha+2}{\alpha+1}-1}\beta^{2-\frac{\alpha+2}{\alpha+1}}}{\alpha+1}}\)

So the two are equivalent…

Heterogeneous case

Now, in the context of regression\(N_i\sim\mathcal{P}(\lambda_i)\text{ with }\lambda_i=\exp[\boldsymbol{x}_i^\top\boldsymbol{\beta}_{\lambda}]\)
and\(Y_{j,i}\sim\mathcal{G}(\mu_i,\phi)\text{ with }\mu_i=\exp[\boldsymbol{x}_i^\top\boldsymbol{\beta}_{\mu}]\)
Then \(S_i=Y_{1,i}+\cdots+Y_{N,i}\) has a Tweedie distribution

variance function power is \(p=\displaystyle{\frac{\phi+2}{\phi+1}}\)
mean is \(\lambda_i \mu_i\)
scale parameter is\(\displaystyle{\frac{\lambda_i^{\frac{1}{\phi+1}-1}}{\mu_i^{\frac{\phi}{\phi+1}}}\left(\frac{\phi}{1+\phi}\right)}\)

There are \(1+2\text{dim}(\boldsymbol{X})\) degrees of freedom here. And a Tweedie regression is

variance function power is \(p\in(1,2)\)
mean is \(\mu_i=\exp[\boldsymbol{x}_i^{\top}\boldsymbol{\beta}_{\text{Tweedie}}]\)
scale parameter is \(\phi\)

There are now \(2+\text{dim}(\boldsymbol{X})\) degrees of freedom.

In the actuarial terminology

\(N\) is the annual claim frequency
\(Y\) is the cost of single claims
\(S\) is the annual cost for a single insurance policy

As explained in our book, frequency and costs can be explained by different features, so that itself is a motifivation to consider two models. But consider the following simulated data

n = 1e4 a=2 set.seed(123) x = runif(n) etan = exp(-2+a*x) N = rpois(n,etan) dfn = data.frame(y=N,x=x) I=rep(1:n,N) etaz = exp(2-a*x[I]) Z = rgamma(sum(N),etaz,20) dfz = data.frame(y=Z,x=x[I]) S=tapply(Z,as.factor(I),sum) V=as.numeric(S[as.character(1:n)]) V[is.na(V)]=0 dfy = data.frame(y=V,x=x)

We can run two regressions, for the frequency, and for the costs

regn = glm(y~x, family=poisson(link="log"),data=dfn) regz = glm(y~x, family=Gamma(link="log"),data=dfz)

For the tweedie regression, let us find the optimal power parameter

library(statmod) library(tweedie) glmtw = function(t){ m = glm(y~x, family=tweedie(var.power = t, link.power = 0),data=dfy) d = NULL if(t == 1) d = 1 AICtweedie(m, dispersion = d) } vt = seq(1.01,1.99,length=251) vg = Vectorize(glmtw)(vt) plot(vt,vg,log="y",type="l") i=which.min(vg)

and consider the associated Tweedie regression.

regy = glm(y~x, family=tweedie(var.power = vt[i], link.power = 0),data=dfy)

For frequency, there is a clear increase of the average frequency with \(x\) (and significant)

summary(regy)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.00822 0.04101 -73.356 <2e-16 *** x -0.02226 0.07154 -0.311 0.756 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Tweedie family taken to be 0.6516459)

For the individual costs, there is a clear decline of the average cost with \(x\) (and highly significant)

summary(regn)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.01508 0.04135 -48.73 <2e-16 *** x 1.99036 0.05887 33.81 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Now, if we consider the average cost for the policy, we have

summary(regy)

(Dispersion parameter for Tweedie family taken to be 0.6516459)

I.e., the average annual cost for a single policy does not depend on \(x\) (it is clearly not significant). As the product of the frequency and the average costs tells more or less the same story…

If the outcome, the price, is the same, one could agree that having here the two regressions is much more informative for risk management (if one wants to introduce deductibles for instance).

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Homogeneous case

Heterogeneous case

Related