Natura non facit saltus
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(see John Wilkins’ article on the – interesting – history of that phrase http://scienceblogs.com/evolvingthoughts/…). We will see, this week in class, several smoothing techniques, for insurance ratemaking. As a starting point, assume that we do not want to use segmentation techniques: everyone will pay exactly the same price.
- no segmentation of the premium
And that price should be related to the pure premium, which is proportional to the frequency (or the annualized frequency, as discussed previously), since
The probability measure is mentioned here just to recall that we can use any measure. Even (based on some covariates). Without any covariate, the expected frequency should be
> regglm0=glm(nbre~1+offset(log(exposition)),data=sinistres,family=poisson) > summary(regglm0) Call: glm(formula = nbre ~ 1 + offset(log(exposition)), family = poisson, data = sinistres) Deviance Residuals: Min 1Q Median 3Q Max -0.5033 -0.3719 -0.2588 -0.1376 13.2700 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.6201 0.0228 -114.9 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 12680 on 49999 degrees of freedom Residual deviance: 12680 on 49999 degrees of freedom AIC: 16353 Number of Fisher Scoring iterations: 6 > exp(coefficients(regglm0)) (Intercept) 0.07279295
Thus, if we do not want to take into account potential heterogeneity, we should assume that where is closed to 7.28%. Yes, as mentioned in class, it is rather common to see as a percentage, i.e. a probability, since
i.e. can be interpreted as the probability of not have a claim (see also the law of small numbers). Let us visualize this as a function of the age of the driver,
> a=18:100 > yp=predict(regglm0,newdata=data.frame(ageconducteur=a,exposition=1),type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey") > lines(a,yp1,lty=2) > lines(a,yp2,lty=2) > k=23 > points(a[k],yp0[k],pch=3,lwd=3,col="red") > segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)
We do predict the same frequency for all drivers, e.g. for some drive aged 40,
> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k]) Frequency = 0.07279295 confidence interval 0.07611196 0.06947393
Let us now consider the case where we try to take into account heterogeneity, e.g. by age,
- The (standard) Poisson regression
The idea of the (log-)Poisson regression is to assume that instead of having , we should have , where
in a very general setting. Here, let us consider only one explanatory variable, i.e.
Here, we have
> yp=predict(regglm1,newdata=data.frame(ageconducteur=a,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey") > lines(a,yp1,lty=2) > lines(a,yp2,lty=2) > points(a[k],yp0[k],pch=3,lwd=3,col="red") > segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)
i.e. the prediction for the annualized claim frequency for our 40 year old driver is now 7.74% (which is slightly higher than what we had before, 7.28%)
> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k]) Frequency = 0.07740574 confidence interval 0.08117512 0.07363636
It is possible to compute not the expected frequency , but the ratio .
Above the horizontal blue line, the premium will be higher than the one obtained without segmentation, and (of course) lower below. Here, drivers younger than 44 year old will pay more, while driver older than 44 year old will be less. We have discussed, in the introduction, the necessity of segmentation. If we consider two companies, one segmenting, while the other one has a flat rate, then older drivers will go to the first company (since insurance is cheaper) while younger ones will go to the second one (again, it is cheaper). The problem is that the second company implicitly hopes that older drivers will compensate the risk. But since they’re gone, insurance will be too cheap, and the company will loose money (if not goes bankrupt). So companies have to use segmentation techniques to survive. Now, the problem is that we cannot be sure that this exponential decay of the premium is the proper way the premium should evolve as a function of the age. An alternative can be to use nonparametric techniques to visualize to true influence of the age on claims frequency.
- A pure nonparametric model
A first model can be to consider a premium, per age. This can be done considering the age of the driver as a factor in the regression,
> regglm2=glm(nbre~as.factor(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson) > yp=predict(regglm2,newdata=data.frame(ageconducteur=a0,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a0,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey")
Here, the forecast for our 40 year old driver is slightly lower than be previous one, but the confidence interval is much larger (since we focus on a very small subclass of the portfolio: drivers aged exactly 40)
Frequency = 0.06686658 confidence interval 0.08750205 0.0462311
Here, we consider too small classes, and the premium is too erratic: the premium will decrease of 20% from age 40 to 41, and then increase of 50% from age 41 to 42,
> diff(log(yp0[23:25])) 24 25 -0.2330241 0.5223478
There is no chance that the company will keep the insured with this strategy. This discontinuity of the premium is clearly an important issue here.
- Using age classes
An alternative can be to consider age classes, from very young drivers to senior drivers.
> level1=seq(15,105,by=5) > regglmc1=glm(nbre~cut(ageconducteur,level1)+offset(log(exposition)), + data=sinistres,family=poisson) > summary(regglmc1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6036 0.1741 -9.212 < 2e-16 *** cut(ageconducteur, level1)(20,25] -0.4200 0.1948 -2.157 0.0310 * cut(ageconducteur, level1)(25,30] -0.9378 0.1903 -4.927 8.33e-07 *** cut(ageconducteur, level1)(30,35] -1.0030 0.1869 -5.367 8.02e-08 *** cut(ageconducteur, level1)(35,40] -1.0779 0.1866 -5.776 7.65e-09 *** cut(ageconducteur, level1)(40,45] -1.0264 0.1858 -5.526 3.28e-08 *** cut(ageconducteur, level1)(45,50] -0.9978 0.1856 -5.377 7.58e-08 *** cut(ageconducteur, level1)(50,55] -1.0137 0.1855 -5.464 4.65e-08 *** cut(ageconducteur, level1)(55,60] -1.2036 0.1939 -6.207 5.40e-10 *** cut(ageconducteur, level1)(60,65] -1.1411 0.2008 -5.684 1.31e-08 *** cut(ageconducteur, level1)(65,70] -1.2114 0.2085 -5.811 6.22e-09 *** cut(ageconducteur, level1)(70,75] -1.3285 0.2210 -6.012 1.83e-09 *** cut(ageconducteur, level1)(75,80] -0.9814 0.2271 -4.321 1.55e-05 *** cut(ageconducteur, level1)(80,85] -1.4782 0.3371 -4.385 1.16e-05 *** cut(ageconducteur, level1)(85,90] -1.2120 0.5294 -2.289 0.0221 * cut(ageconducteur, level1)(90,95] -0.9728 1.0150 -0.958 0.3379 cut(ageconducteur, level1)(95,100] -11.4694 144.2817 -0.079 0.9366 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > yp=predict(regglmc1,newdata=data.frame(ageconducteur=a,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,ylim=c(.03,.12),type="s") > abline(v=40,col="grey") > lines(a,yp1,lty=2,type="s") > lines(a,yp2,lty=2,type="s")
Here we obtain the following predictions,
and for our 40 year old driver, the frequency is now 6.84%.
Frequency = 0.0684573 confidence interval 0.07766717 0.05924742
But our classes were defined arbitrarily here. Perhaps should we consider other classes, to see if the prediction is sensitive to the cutting values,
> level2=level1-2 > regglmc2=glm(nbre~cut(ageconducteur,level2)+offset(log(exposition)), + data=sinistres,family=poisson)
which yields the following values for our 40 year old driver,
Frequency = 0.07050614 confidence interval 0.07980422 0.06120807
So here, we did not remove the discontinuity problem. An idea here can be to consider moving regions: if the goal is to predict the frequency for a 40 year old driver, perhaps the class should be (somehow) centered around 40. And center the interval around 35 for drivers aged 35. Etc.
- Moving average
Thus, it is natural to consider some local regressions, where only drivers aged almost 40 should be considered. This almost concept is related to the bandwidth. For instance, drivers between 35 and 45 can be considered as being almost40. In practice we can either consider a subset function, or we can use weights in the regressions
> value=40 > h=5 > sinistres$omega=(abs(sinistres$ageconducteur-value)<=h)*1 > regglmomega=glm(nbre~ageconducteur+offset(log(exposition)), + data=sinistres,family=poisson,weights=omega)
To see what’s going on, let us consider an animated plot, where the age of interest is changing,
Here, for our 40 year old drive, we get
Frequency = 0.06913391 confidence interval 0.07535564 0.06291218
We do obtain a curve that can be interpreted as a local regression. But here, we do not take into account that 35 is not as close to 40 as 39 could be. An here, 34 is assumed to be very far away from 40. Clearly, we could improve that technique: kernel functions can considered, i.e. the closer to 40, the larger the weight.
> value=40 > h=5 > sinistres$omega=dnorm(abs(sinistres$ageconducteur-value)/h) > regglmomega=glm(nbre~ageconducteur+offset(log(exposition)), + data=sinistres,family=poisson,weights=omega)
which can be plotted below
Here, our prediction for our 40 year old drive is
Frequency = 0.07040464 confidence interval 0.07981521 0.06099408
This is the idea of kernel regression techniques. But as explained in the slides, other non parametric techniques can be considered, like spline functions.
- Smoothing with splines
In R, it is simple to use spline function (somehow much more simple than kernel smoothers)
> library(splines) > regglmbs=glm(nbre~bs(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson)
The prediction for our 40 year old driver is now
Frequency = 0.06928169 confidence interval 0.07397124 0.06459215
Note that this techniques is related to another class of models, the so-called Generalized Additive Models, i.e. GAMs.
> library(mgcv) > reggam=gam(nbre~s(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson)
The prediction is extremely close to the one we obtained above (the main differences being observed for very old drivers)
Frequency = 0.06912683 confidence interval 0.07501663 0.06323702
- Comparison of the different models
Somehow, one way or another, all those models are valid. So perhaps we should compare them,
On the graph above, we can visualize the upper and the lower bound of the prediction, for the 9 models. The horizontal line is the predicted value without taking into account heterogeneity. It is possible to consider relative values, with respect to this value,
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.