Natura non facit saltus

Posted on February 5, 2013 by arthur charpentier in Uncategorized | 0 Comments

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

(see John Wilkins’ article on the – interesting – history of that phrase http://scienceblogs.com/evolvingthoughts/…). We will see, this week in class, several smoothing techniques, for insurance ratemaking. As a starting point, assume that we do not want to use segmentation techniques: everyone will pay exactly the same price.

no segmentation of the premium

And that price should be related to the pure premium, which is proportional to the frequency (or the annualized frequency, as discussed previously), since

$http://latex.codecogs.com/gif.latex?\mathbb{E}_{\mathbb{P}}\left(\sum_{i=1}^N%20Y_i\right)=\mathbb{E}_{\mathbb{P}}(N)%20\cdot%20\mathbb{E}_{\mathbb{P}}(Y_i)$

The probability measure is mentioned here just to recall that we can use any measure. Even $http://latex.codecogs.com/gif.latex?\mathbb{P}_{\boldsymbol{X}}$ (based on some covariates). Without any covariate, the expected frequency should be

> regglm0=glm(nbre~1+offset(log(exposition)),data=sinistres,family=poisson)
> summary(regglm0)

Call:
glm(formula = nbre ~ 1 + offset(log(exposition)), family = poisson, 
    data = sinistres)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5033  -0.3719  -0.2588  -0.1376  13.2700  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6201     0.0228  -114.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12680  on 49999  degrees of freedom
Residual deviance: 12680  on 49999  degrees of freedom
AIC: 16353

Number of Fisher Scoring iterations: 6
> exp(coefficients(regglm0))
(Intercept) 
 0.07279295

Thus, if we do not want to take into account potential heterogeneity, we should assume that $http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda)$ where $http://latex.codecogs.com/gif.latex?\lambda$ is closed to 7.28%. Yes, as mentioned in class, it is rather common to see $http://latex.codecogs.com/gif.latex?\lambda$ as a percentage, i.e. a probability, since

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N\neq%200)=1-e^{-\lambda}\approx%20\lambda$

i.e. $http://latex.codecogs.com/gif.latex?\lambda$ can be interpreted as the probability of not have a claim (see also the law of small numbers). Let us visualize this as a function of the age of the driver,

> a=18:100
> yp=predict(regglm0,newdata=data.frame(ageconducteur=a,exposition=1),type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> k=23
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

We do predict the same frequency for all drivers, e.g. for some drive aged 40,

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07279295  confidence interval 0.07611196 0.06947393

Let us now consider the case where we try to take into account heterogeneity, e.g. by age,

The (standard) Poisson regression

The idea of the (log-)Poisson regression is to assume that instead of having $http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda)$ , we should have $http://latex.codecogs.com/gif.latex?N|\boldsymbol{X}\sim\mathcal{P}(\lambda_{\boldsymbol{X}})$ , where

$http://latex.codecogs.com/gif.latex?\lambda_{\boldsymbol{X}}=\exp(\beta_0+\beta_1%20\boldsymbol{X}_1+\cdots+\beta_k\boldsymbol{X}_k)$

in a very general setting. Here, let us consider only one explanatory variable, i.e.

$http://latex.codecogs.com/gif.latex?\lambda_{X}=\exp(\beta_0+\beta_1%20{X})$

Here, we have

> yp=predict(regglm1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

i.e. the prediction for the annualized claim frequency for our 40 year old driver is now 7.74% (which is slightly higher than what we had before, 7.28%)

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07740574  confidence interval 0.08117512 0.07363636

It is possible to compute not the expected frequency , but the ratio $http://latex.codecogs.com/gif.latex?\mathbb{E}(N|X)/\mathbb{E}(N)$ .

Above the horizontal blue line, the premium will be higher than the one obtained without segmentation, and (of course) lower below. Here, drivers younger than 44 year old will pay more, while driver older than 44 year old will be less. We have discussed, in the introduction, the necessity of segmentation. If we consider two companies, one segmenting, while the other one has a flat rate, then older drivers will go to the first company (since insurance is cheaper) while younger ones will go to the second one (again, it is cheaper). The problem is that the second company implicitly hopes that older drivers will compensate the risk. But since they’re gone, insurance will be too cheap, and the company will loose money (if not goes bankrupt). So companies have to use segmentation techniques to survive. Now, the problem is that we cannot be sure that this exponential decay of the premium is the proper way the premium should evolve as a function of the age. An alternative can be to use nonparametric techniques to visualize to true influence of the age on claims frequency.

A pure nonparametric model

A first model can be to consider a premium, per age. This can be done considering the age of the driver as a factor in the regression,

> regglm2=glm(nbre~as.factor(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> yp=predict(regglm2,newdata=data.frame(ageconducteur=a0,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")

Here, the forecast for our 40 year old driver is slightly lower than be previous one, but the confidence interval is much larger (since we focus on a very small subclass of the portfolio: drivers aged exactly 40)

Frequency = 0.06686658  confidence interval 0.08750205 0.0462311

Here, we consider too small classes, and the premium is too erratic: the premium will decrease of 20% from age 40 to 41, and then increase of 50% from age 41 to 42,

> diff(log(yp0[23:25]))
        24         25 
-0.2330241  0.5223478

There is no chance that the company will keep the insured with this strategy. This discontinuity of the premium is clearly an important issue here.

Using age classes

An alternative can be to consider age classes, from very young drivers to senior drivers.

> level1=seq(15,105,by=5)
> regglmc1=glm(nbre~cut(ageconducteur,level1)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> summary(regglmc1)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.6036     0.1741  -9.212  < 2e-16 ***
cut(ageconducteur, level1)(20,25]   -0.4200     0.1948  -2.157   0.0310 *  
cut(ageconducteur, level1)(25,30]   -0.9378     0.1903  -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35]   -1.0030     0.1869  -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40]   -1.0779     0.1866  -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45]   -1.0264     0.1858  -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50]   -0.9978     0.1856  -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55]   -1.0137     0.1855  -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60]   -1.2036     0.1939  -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65]   -1.1411     0.2008  -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70]   -1.2114     0.2085  -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75]   -1.3285     0.2210  -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80]   -0.9814     0.2271  -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85]   -1.4782     0.3371  -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90]   -1.2120     0.5294  -2.289   0.0221 *  
cut(ageconducteur, level1)(90,95]   -0.9728     1.0150  -0.958   0.3379    
cut(ageconducteur, level1)(95,100] -11.4694   144.2817  -0.079   0.9366    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> yp=predict(regglmc1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,ylim=c(.03,.12),type="s")
> abline(v=40,col="grey")
> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")

Here we obtain the following predictions,

and for our 40 year old driver, the frequency is now 6.84%.

Frequency = 0.0684573  confidence interval 0.07766717 0.05924742

But our classes were defined arbitrarily here. Perhaps should we consider other classes, to see if the prediction is sensitive to the cutting values,

> level2=level1-2
> regglmc2=glm(nbre~cut(ageconducteur,level2)+offset(log(exposition)),
+ data=sinistres,family=poisson)

which yields the following values for our 40 year old driver,

Frequency = 0.07050614  confidence interval 0.07980422 0.06120807

So here, we did not remove the discontinuity problem. An idea here can be to consider moving regions: if the goal is to predict the frequency for a 40 year old driver, perhaps the class should be (somehow) centered around 40. And center the interval around 35 for drivers aged 35. Etc.

Moving average

Thus, it is natural to consider some local regressions, where only drivers aged almost 40 should be considered. This almost concept is related to the bandwidth. For instance, drivers between 35 and 45 can be considered as being almost40. In practice we can either consider a subset function, or we can use weights in the regressions

> value=40
> h=5
> sinistres$omega=(abs(sinistres$ageconducteur-value)<=h)*1
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

To see what’s going on, let us consider an animated plot, where the age of interest is changing,

Here, for our 40 year old drive, we get

Frequency = 0.06913391  confidence interval 0.07535564 0.06291218

We do obtain a curve that can be interpreted as a local regression. But here, we do not take into account that 35 is not as close to 40 as 39 could be. An here, 34 is assumed to be very far away from 40. Clearly, we could improve that technique: kernel functions can considered, i.e. the closer to 40, the larger the weight.

> value=40
> h=5
> sinistres$omega=dnorm(abs(sinistres$ageconducteur-value)/h)
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

which can be plotted below

Here, our prediction for our 40 year old drive is

Frequency = 0.07040464  confidence interval 0.07981521 0.06099408

This is the idea of kernel regression techniques. But as explained in the slides, other non parametric techniques can be considered, like spline functions.

Smoothing with splines

In R, it is simple to use spline function (somehow much more simple than kernel smoothers)

> library(splines)
> regglmbs=glm(nbre~bs(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction for our 40 year old driver is now

Frequency = 0.06928169  confidence interval 0.07397124 0.06459215

Note that this techniques is related to another class of models, the so-called Generalized Additive Models, i.e. GAMs.

> library(mgcv)
> reggam=gam(nbre~s(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction is extremely close to the one we obtained above (the main differences being observed for very old drivers)

Frequency = 0.06912683  confidence interval 0.07501663 0.06323702

Comparison of the different models

Somehow, one way or another, all those models are valid. So perhaps we should compare them,

On the graph above, we can visualize the upper and the lower bound of the prediction, for the 9 models. The horizontal line is the predicted value without taking into account heterogeneity. It is possible to consider relative values, with respect to this value,

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.