Poisson regression on non-integers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the course on claims reserving techniques, I did mention the use of Poisson regression, even if incremental payments were not integers. For instance, we did consider incremental triangles
> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R") > INC=PAID > INC[,2:6]=PAID[,2:6]-PAID[,1:5] > INC [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3209 1163 39 17 7 21 [2,] 3367 1292 37 24 10 NA [3,] 3871 1474 53 22 NA NA [4,] 4239 1678 103 NA NA NA [5,] 4929 1865 NA NA NA NA [6,] 5217 NA NA NA NA NA
On those payments, it is natural to use a Poisson regression, to predict future payments
> Y=as.vector(INC) > D=rep(1:6,each=6) > A=rep(2001:2006,6) > base=data.frame(Y,D,A) > reg=glm(Y~as.factor(D)+as.factor(A),data=base,family=poisson(link="log")) > Yp=predict(reg,type="response",newdata=base) > matrix(Yp,6,6) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 21.0 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
and the total amount of reserves would be
> sum(Yp[is.na(Y)==TRUE]) [1] 2426.985
Here, payments were in ’000 euros. What if they were in ’000’000 euros ?
> a=1000 > INC/a [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3.209 1.163 0.039 0.017 0.007 0.021 [2,] 3.367 1.292 0.037 0.024 0.010 NA [3,] 3.871 1.474 0.053 0.022 NA NA [4,] 4.239 1.678 0.103 NA NA NA [5,] 4.929 1.865 NA NA NA NA [6,] 5.217 NA NA NA NA NA
We can still run a regression here
> reg=glm((Y/a)~as.factor(D)+as.factor(A),data=base,family=poisson(link="log")) > Yp=predict(reg,type="response",newdata=base) > sum(Yp[is.na(Y)==TRUE])*a [1] 2426.985
and the prediction is exactly the same. Actually, it is possible to change currency, and multiply by any kind of constant, the Poisson regression will return always the same prediction, if we use a log link function,
> homogeneity=function(a=1){ + reg=glm((Y/a)~as.factor(D)+as.factor(A), data=base,family=poisson(link="log")) + Yp=predict(reg,type="response",newdata=base) + return(sum(Yp[is.na(Y)==TRUE])*a) + } > Vectorize(homogeneity)(10^(seq(-3,5))) [1] 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985
The trick here come from the fact that we do like the Poisson interpretation. But GLMs simply mean that we do want to solve a first order condition. It is possible to solve explicitly the first order condition, which was obtained without any condition such that values should be integers. To run a simple code, the intercept should be related to the last value of the matrix, not the first one.
> base$D=relevel(as.factor(base$D),"6") > base$A=relevel(as.factor(base$A),"2006") > reg=glm(Y~as.factor(D)+as.factor(A), data=base,family=poisson(link="log")) > summary(reg) Call: glm(formula = Y ~ as.factor(D) + as.factor(A), family = poisson(link = "log"), data = base) Deviance Residuals: Min 1Q Median 3Q Max -2.3426 -0.4996 0.0000 0.2770 3.9355 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.54723 0.21921 16.182 < 2e-16 *** as.factor(D)1 5.01244 0.21877 22.912 < 2e-16 *** as.factor(D)2 4.04731 0.21896 18.484 < 2e-16 *** as.factor(D)3 0.86391 0.22827 3.785 0.000154 *** as.factor(D)4 -0.09254 0.25229 -0.367 0.713754 as.factor(D)5 -0.93717 0.32643 -2.871 0.004092 ** as.factor(A)2001 -0.50271 0.02079 -24.179 < 2e-16 *** as.factor(A)2002 -0.43831 0.02045 -21.433 < 2e-16 *** as.factor(A)2003 -0.30029 0.01978 -15.184 < 2e-16 *** as.factor(A)2004 -0.19096 0.01930 -9.895 < 2e-16 *** as.factor(A)2005 -0.05864 0.01879 -3.121 0.001799 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 46695.269 on 20 degrees of freedom Residual deviance: 30.214 on 10 degrees of freedom (15 observations deleted due to missingness) AIC: 209.52
The first idea is to run a gradient descent, as follows (the starting point will be coefficients from a linear regression on the log of the observations),
> YNA <- Y > XNA=matrix(0,length(Y),1+5+5) > XNA[,1]=rep(1,length(Y)) > for(k in 1:5) XNA[(k-1)*6+1:6,k+1]=k > u=(1:(length(Y))%%6); u[u==0]=6 > for(k in 1:5) XNA[u==k,k+6]=k > YnoNA=YNA[is.na(YNA)==FALSE] > XnoNA=XNA[is.na(YNA)==FALSE,] > beta=lm(log(YnoNA)~0+XnoNA)$coefficients > for(s in 1:50){ + Ypred=exp(XnoNA%*%beta) + gradient=t(XnoNA)%*%(YnoNA-Ypred) + omega=matrix(0,nrow(XnoNA),nrow(XnoNA));diag(omega)=exp(XnoNA%*%beta) + hessienne=-t(XnoNA)%*%omega%*%XnoNA + beta=beta-solve(hessienne)%*%gradient} > beta [,1] [1,] 3.54723486 [2,] 5.01244294 [3,] 2.02365553 [4,] 0.28796945 [5,] -0.02313601 [6,] -0.18743467 [7,] -0.50271242 [8,] -0.21915742 [9,] -0.10009587 [10,] -0.04774056 [11,] -0.01172840
We are not too far away from the values given by R. Actually, it is just fine if we focus on the predictions
> matrix(exp(XNA%*%beta),6,6)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 21.0 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
which are exactly the one obtained above. And here, we clearly see that there is no assumption such as “explained variate should be an integer“. It is also possible to remember that the first order condition is the same as the one we had with a weighted least square model. The problem is that the weights are function of the prediction. But using an iterative algorithm, we should converge,
> beta=lm(log(YnoNA)~0+XnoNA)$coefficients > for(i in 1:50){ + Ypred=exp(XnoNA%*%beta) + z=XnoNA%*%beta+(YnoNA-Ypred)/Ypred + REG=lm(z~0+XnoNA,weights=Ypred) + beta=REG$coefficients + } > > beta XnoNA1 XnoNA2 XnoNA3 XnoNA4 XnoNA5 XnoNA6 3.54723486 5.01244294 2.02365553 0.28796945 -0.02313601 -0.18743467 XnoNA7 XnoNA8 XnoNA9 XnoNA10 XnoNA11 -0.50271242 -0.21915742 -0.10009587 -0.04774056 -0.01172840
which are the same values as the one we got previously. Here again, the prediction is the same as the one we got from this so-called Poisson regression,
> matrix(exp(XNA%*%beta),6,6) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 20.9 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
Again, it works just fine because GLMs are mainly conditions on the first two moments, and numerical computations are based on the first order condition, which has less constraints than the interpretation in terms of a Poisson model.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.