Exposure as a possible explanatory variable
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Iin insurance pricing, the exposure is usually used as an offset variable to model claims frequency. As explained many times on this blog (e.g. here), and in my notes, if we have to identical drivers, but one with an exposure of 6 months, and the other one of one year, it should be natural to assume that, on average, the second driver will have two times more accidents. This is the motivation to use a standard (homogeneous) Poisson process to model claim frequency. One can also see here legal issue, since, in case of a (partial) reinbursement of a premium, it would be done prorata temporis. The risk is proportional to the exposure. Thus, if denote the number of claims of insured , with characteristics and exposure , with a Poisson regression, we would write
or equivalently
From this expression, the logarithm of the exposure is an explanatory variable, but there should be no coefficient (the coefficient here is taken to be one). Can’t we use the exposure as an explanatory variable ? Will we get a unit parameter ?
Of course, in the context of ratemaking, it is probably not a relevant question, since actuaries are required to predict annual claim frequency (since insurance contract are supposed to provide a one year coverage). But it might be interesting to get a better understanding of why people might be leaving our portfolio (i.e. are cancelling their insurance policy before term, or not renew someday).
To be more specific and get a better understanding, consider the following model: consider a Poisson process to model claims arrival, and people dedicated to their insurance company (they never leave). Let us generate scenarios over twenty years
> n=983 > D1=as.Date("01/01/1993",'%d/%m/%Y') > D2=as.Date("31/12/2013",'%d/%m/%Y') > L=D1+0:(D2-D1) > set.seed(1) > arrival=sample(L,size=n,replace=TRUE) > exposure=N=rep(NA,n) > departure=rep(D2,n) > set.seed(2) > for(i in 1:n){ + expo=D2-arrival[i] + w=0 + while(max(w)<expo) w=c(w,max(w)+1+trunc(rexp(1,1/1000))) + exposure[i]=departure[i]-arrival[i] + N[i]=max(0,length(w)-2)} > df=data.frame(N=N,E=exposure/365)
Here the expected time between claims is considered to be 1000 days. The (annual) intensity of the Poisson process is here
> 365/1000 [1] 0.365
so if we run a Poisson regression on the logarithm of the exposure (please feel free to had other covariates if you want, the example here is just to see what could happen when exposure is considered as a standard covariate), we should get a parameter close to
> log(365/1000) [1] -1.007858
Here, the regression on a constant, with the offset variable is
> reg=glm(N~1+offset(log(E)),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ 1 + offset(log(E)), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -3.4145 -0.4673 0.2367 0.8770 3.6828 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.04233 0.02532 -41.17 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1116.9 on 982 degrees of freedom Residual deviance: 1116.9 on 982 degrees of freedom AIC: 3282.9 Number of Fisher Scoring iterations: 5
which is consistent with what we just said. If we run the regression with the logarithm of the exposure as a possible explanatory variable, we would expect to have a coefficient close to 1. And indeed…
> reg=glm(N~log(E),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -3.0810 -0.8373 -0.1493 0.5676 3.9001 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.03350 0.08546 -12.09 <2e-16 *** log(E) 1.00920 0.03292 30.66 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2553.6 on 982 degrees of freedom Residual deviance: 1064.2 on 981 degrees of freedom AIC: 3762.7 Number of Fisher Scoring iterations: 5
If we keep the offset, and add the variable, we can see that it become useless (which is a test of a unit parameter, somehow)
> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E) + offset(log(E)), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -3.0810 -0.8373 -0.1493 0.5676 3.9001 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.033503 0.085460 -12.093 <2e-16 *** log(E) 0.009201 0.032920 0.279 0.78 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1064.3 on 982 degrees of freedom Residual deviance: 1064.2 on 981 degrees of freedom AIC: 3762.7 Number of Fisher Scoring iterations: 5
Here, we do have pure Poisson processes, so exposure is crucial, since the parameter of the Poisson distribution is proportional to the exposure. But we cannot learn anything else from the exposure.
Consider some real data.
> head(baseFREQ) nocontrat exposition zone puissance agevehicule 1 27 0.87 C 7 0 2 115 0.72 D 5 0 3 121 0.05 C 6 0 4 142 0.90 C 10 10 5 155 0.12 C 7 0 6 186 0.83 C 5 0 ageconducteur bonus marque carburant densite region nbre 1 56 50 12 D 93 13 0 2 45 50 12 E 54 13 0 3 37 55 12 D 11 13 0 4 42 50 12 D 93 13 0 5 59 50 12 E 73 13 0 6 75 50 12 E 42 13 0
What do we get if we consider a Poisson regression on the logarithm of the exposure ?
> reg=glm(nbre~log(exposition),data=baseFREQ,family=poisson) > summary(reg) Call: glm(formula = nbre ~ log(exposition), family = poisson, data = baseFREQ) Deviance Residuals: Min 1Q Median 3Q Max -0.3988 -0.3388 -0.2786 -0.1981 12.9036 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.83045 0.02822 -100.31 <2e-16 *** log(exposition) 0.53950 0.02905 18.57 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 12931 on 49999 degrees of freedom Residual deviance: 12475 on 49998 degrees of freedom AIC: 16150 Number of Fisher Scoring iterations: 6
If we add the exposure to the offset, what’s happening ? (let us use a nonparametric transformation, so visualize what’s going on)
> library(gam) > reg=gam(nbre~offset(log(exposition))+s(exposition),data=baseFREQ,family=poisson) > plot(reg,se=TRUE)
There is a clear and significant effect. The more insured stay, the less likely they get a claim. Actually, it can be observed without running a regression.
> i1=which(baseFREQ$nbre>0) > i0=which(baseFREQ$nbre==0) > h1=hist(baseFREQ$exposition[i1],probability=TRUE) > h0=hist(baseFREQ$exposition[i0],probability=TRUE) > plot(h1$mids,h1$density,type='s',lwd=2,col="red") > lines(h0$mids,h0$density,type='s',col='blue',lwd=2)
In blue, we have the density of the exposure for those who did not have claims, and in red, the density of those who did have one claim (or more)
So here, we cannot assume a unit value for the parameter. What does that mean ? Can we reproduce such a behavior ?
In order to get a better understandung, consider two possible behaviors for the insured. The first one will be the following : if the company does not offer substantial discounts after no several years with no claims, the insured might leave the company. For instance, if the insured has no claim during 5 years, then after 5 years, he will leave the company (to get a better price somewhere else, say). The code will be
> for(i in 1:n){ + expo=D2-arrival[i] + w=c(0,0) + while((max(w)<expo) & (max(diff(w))<1500)) w=c(w,max(w)+trunc(rexp(1,1/1000))) + if(max(diff(w))>1500) departure[i]=arrival[i]+max(w[-length(w)])+1500 + exposure[i]=departure[i]-arrival[i] + N[i]=max(0,length(w)-3)} > df=data.frame(N=N,E=exposure/365)
Here, I consider 1500 days, instead of 5 years,, but it is the same idea. So, what do we have here ?
> reg=glm(N~log(E),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -1.5684 -0.9668 -0.2321 0.4244 3.6265 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.50844 0.10286 -24.39 <2e-16 *** log(E) 1.65738 0.04494 36.88 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2567.31 on 982 degrees of freedom Residual deviance: 885.71 on 981 degrees of freedom AIC: 2897.9
Here, the coefficient is (significantly) larger than 1. More precisely,
> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E) + offset(log(E)), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -1.5684 -0.9668 -0.2321 0.4244 3.6265 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.50844 0.10286 -24.39 <2e-16 *** log(E) 0.65738 0.04494 14.63 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1114.24 on 982 degrees of freedom Residual deviance: 885.71 on 981 degrees of freedom AIC: 2897.9
There is clearly a bias here : people staying long are more like likely to have an accident. Which is consistent with our story, since clients with low risks left.
The second behavior will be the following : sometimes, the insured are not satisfied with the way claims are handled, and they might leave after the first claim. Consider the case where, after one claim, it is likely (e.g. with probability 50%) that the insured leaves the company. Instead of assuming that the insured did not like claims management, consider the case were the car is so damaged that he cannot drive it anymore. So it will be useless to pay an insurance premium. The code here will be
> for(i in 1:n){ + expo=D2-arrival[i] + w=0 + stay=TRUE + while((max(w)<expo) & (stay==TRUE)) { w=c(w,max(w)+trunc(rexp(1,1/1000))) + stay=sample(c(TRUE,FALSE),prob=c(.5,.5),size=1)} + N[i]=length(w)-2 + if(stay==FALSE) {departure[i]=arrival[i]+max(w) + N[i]=length(w)-1} + exposure[i]=departure[i]-arrival[i]} > df=data.frame(N=N,E=exposure/365)
Here, after each claim, the insured toss a coin to see if he cancels the contract, or not.
> reg=glm(N~log(E),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.28402 -0.47763 -0.08215 0.33819 2.37628 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.09920 0.04251 2.334 0.0196 * log(E) 0.30640 0.02511 12.203 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 666.92 on 982 degrees of freedom Residual deviance: 498.29 on 981 degrees of freedom AIC: 2666.3
This time, the parameter is (again significantly) smaller than one.
> reg=glm(N~log(E)+offset(log(E)),data=df,family=poisson) > summary(reg) Call: glm(formula = N ~ log(E) + offset(log(E)), family = poisson, data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.28402 -0.47763 -0.08215 0.33819 2.37628 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.09920 0.04251 2.334 0.0196 * log(E) -0.69360 0.02511 -27.625 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1116.87 on 982 degrees of freedom Residual deviance: 498.29 on 981 degrees of freedom AIC: 2666.3
The story is now rather different, since those who stay long should not have encountered a lot of opportunities to leave. So clearly, they did not have much claims. If someone has a long exposure, the negative sign in the output above means that he should not have much claims, on average.
As we can see, those models produce rather difference outputs. Note that it is possible much more interpretations. For instance, depending on the way data were extracted,
- all policies observed, over those twenty years,
- all policies in force at some specific date, until now
- all policies in force at some specific date, until one year after
- all policies in force now
So far, we have been using the first method, but the other ones will yield different interpretations, e.g. because of survivor bias. But that’s another story… And one can read Boucher and Denuit (2008) to go further.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.