Large claims, and ratemaking
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
During the course, we have seen that it is natural to assume that not only the individual claims frequency can be explained by some covariates, but individual costs too. Of course, appropriate families should be considered to model the distribution of the cost , given some covariates .Here is the dataset we’ll use,
> sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt", + header=TRUE,sep=";") > sinistres=sinistre[sinistre$garantie=="1RC",] > sinistres=sinistres[sinistres$cout>0,] > contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt", + header=TRUE,sep=";") > couts=merge(sinistres,contrat) > tail(couts) nocontrat no garantie cout exposition zone puissance agevehicule 1919 6104006 11933 1RC 5376.04 0.37 E 6 1 1920 6107355 12349 1RC 51.63 0.74 E 4 1 1921 6108364 13229 1RC 1320.00 0.74 B 9 1 1922 6109171 11567 1RC 1320.00 0.74 B 13 1 1923 6111208 14161 1RC 970.20 0.49 E 10 5 1924 6111650 14476 1RC 1940.40 0.48 E 4 0 ageconducteur bonus marque carburant densite region 1919 32 57 12 E 93 10 1920 45 57 12 E 72 10 1921 32 100 12 E 83 0 1922 56 50 12 E 93 13 1923 30 90 12 E 53 2 1924 69 50 12 E 93 13
Here, each line is a claim. Usual families to model the cost are the Gamma distribution, or the inverse Gaussian. Or the lognormal distribution (which is not in the exponential family, but one can assume that the logarithm of the cost can be modeled with a Gaussian distribution). Consider here only one covariate, e.g. the age of the car, and two different models: a Gamma one, and a lognormal one.
> age=0:20 > reggamma.sp <- glm(cout~agevehicule,family=Gamma(link="log"), + data=couts) > Pgamma <- predict(reggamma.sp,newdata=data.frame(agevehicule=age),type="response")
For the Gamma regression, it is a simple GLM, so it is not difficult. For a lognormal distribution, one should remember that the expected value of a lognormal distribution is not the exponential of the underlying Gaussian distribution. A correction should be made, here to get an unbiased estimator for the average cost,
> reglm.sp <- lm(log(cout)~agevehicule,data=baseCOUT) > sigma <- summary(reglm.sp)$sigma > mu <- predict(reglm.sp,newdata=data.frame(agevehicule=age)) > Pln <- exp(mu+sigma^2/2)
We can plot those two predictions on a single graph,
> plot(age,Pgamma,xlab="",ylab="",col="red",type="b",pch=4) > lines(age,Pln,col="blue",type="b")
Here it is,
Observe that it is also possible to use splines, since there might be no reason for the age to appear here in a multiplicative way,
Here, the two models are rather close. Nevertheless, one should remember that the Gamma model can be extremely sensitive to large claims (I mean here really large claims). On the other hand, with the log-transformation for the lognormal model, it seams that this model is less sensitive to large events. Actually, if I use the complete dataset, the regressions are the following,
i.e. with a lognormal distribution, the average cost is decreasing with the age of the car, while it is increasing with a Gamma model. The main reason here is that there is one large (not to say huge) claim in the dataset,
> couts[which.max(couts$cout),] cout exposition zone puissance agevehicule ageconducteur 7842 4024601 0.22 B 9 13 19 marque carburant densite region 7842 2 E 93 24
One young driver got a $ 4 million claim, with a 13 year old car. This is an outliers for the Gamma regression, that clearly influences the estimation (the second largest if only one third of this one). Since there is a clear influence of large claims on the estimation of the average cost, a natural idea might be to remove those large claims. Or perhaps to see them as different from normal claims: normal claims can be explained by some covariates, but perhaps that those large claims should be shared not only within its own class, but within all the insured on the portfolio. To formalize this idea, observe that we can write
where the blue part is associated to normal-sized claims, while large ones correspond to the red part. It is then possible to run three regressions: one on normal sized claims, one on large claims, and one on the indicator of having a large claims, given that a claim occurred. The code here is something like that: a large claim – here – is above $ 10,000 (one has a fix it)
> s= 10000 > couts$normal=(couts$cout<=s) > mean(couts$normal) [1] 0.9818087
which represent 2% of the claims in our dataset.We can run 3 sets of regressions, with smoothed regression on the age of the car. The first one to model large claims individual costs,
> indice = which(couts$cout>s) > mean(couts$cout[indice]) [1] 34471.59 > library(splines) > regB=glm(cout~bs(agevehicule),data=couts, + subset=indice,family=Gamma(link="log")) > ypB=predict(regB,newdata=data.frame(agevehicule=age),type="response") > ypB2=mean(couts$cout[indice])
the second one to model normal claims individual costs,
> indice = which(couts$cout<=s) > mean(couts$cout[indice]) [1] 1335.878 > regA=glm(cout~bs(agevehicule),data=couts, + subset=indice,family=Gamma(link="log")) > ypA=predict(regA,newdata=data.frame(agevehicule=age),type="response") > ypA2=mean(couts$cout[indice])
And finally, a third one, on the probability of having a normal sized claim, given that a claim occurred
> regC=glm(normal~bs(agevehicule),data=couts,family=binomial) > ypC=predict(regC,newdata=data.frame(agevehicule=age),type="response") > regC2=glm(normal~1,data=couts,family=binomial) > ypC2=predict(regC2,newdata=data.frame(agevehicule=age),type="response")
Note that we to have, each time something that can be interpreted either as , or – i.e. no covariate is considered on the later. On the graph below, we did plot
where Gamma regressions – with splines – are considered for the average costs, while logistic regressions – again with splines – are considered to model probabilities.
(but careful with splines: on borders, since we do not have a lot of observations, the behavior can be… odd. And adjustments should be made to obtain an adequate level of premium). If it is legitimate to assume that normal-sized claims can be explained by some covariates, perhaps large claims (or extremely large ones) are just purely random, i.e. not function of any covariate, at all. I.e.
To go one step further, it might also be possible to assume that not only the size of the claim (given that it is a large one) is not a function of any covariate, but perhaps neither is the probability of having an extremely large claim, too
From the first part, we’ve seen that the distribution considered had an impact on the prediction, and in the second, we’ve seen that the definition of large claims (and how to deal with them) also has an impact. So clearly, actuaries have some leverage when working on ratemaking…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.