Predicting claims with a bayesian network
[This article was first published on mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Here is a little Bayesian Network to predict the claims for two different types of drivers over the next year, see also example 16.16 in [1]. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s assume there are good and bad drivers. The probabilities that a good driver will have 0, 1 or 2 claims in any given year are set to 70%, 20% and 10%, while for bad drivers the probabilities are 50%, 30% and 20% respectively.
Further I assume that 75% of all drivers are good drivers and only 25% would be classified as bad drivers. Therefore the average number of claims per policyholder across the whole customer base would be:
0.75*(0*0.7 + 1*0.2 + 2*0.1) + 0.25*(0*0.5 + 1*0.3 + 2*0.2) = 0.475Now a customer of two years asks for his renewal. Suppose he had no claims in the first year and one claim last year, how many claims should I predict for next year? Or in other words, how much credibility should I give him?
To answer the above question I present the data here as a Bayesian Network using the
gRain
package [2]. I start with the contingency probability tables for the driver type and the conditional probabilities for 0, 1 and 2 claims in year 1 and 2. As I assume independence between the years I set the same probabilities. I can now review my model as a mosaic plot (above) and as a graph (below) as well.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(gRain) | |
# Distribution of good and bad drivers | |
d <- cptable(~ driver, values=c(0.75, 0.25), | |
levels=c("good", "bad")) | |
#Expected claims by driver type | |
claims <- c(0, 1, 2) | |
c1 <- cptable(~ year_1|driver, | |
values=c(0.7, 0.2, 0.1, 0.5, 0.3, 0.2), | |
levels=as.character(claims)) | |
c2 <- cptable(~ year_2|driver, | |
values=c(0.7, 0.2, 0.1, 0.5, 0.3, 0.2), | |
levels=as.character(claims)) | |
plist <- compileCPT(list(d, c1, c2)) | |
## Review probabilities | |
pn <- grain(plist) | |
mosaicplot(pn[["cptlist"]][["year_1"]], color=TRUE, | |
main="Conditional Probabilities", | |
xlab="Number of claims per year", | |
ylab="Driver type") | |
pn[["cptlist"]] | |
## $driver | |
## driver | |
## good bad | |
## 0.75 0.25 | |
## | |
## $year_1 | |
## driver | |
## year_1 good bad | |
## 0 0.7 0.5 | |
## 1 0.2 0.3 | |
## 2 0.1 0.2 | |
## | |
## $year_2 | |
## driver | |
## year_2 good bad | |
## 0 0.7 0.5 | |
## 1 0.2 0.3 | |
## 2 0.1 0.2 | |
(average.claims <- colSums(sweep(plist[["year_1"]], 1, claims, "*"))) | |
## good bad | |
## 0.4 0.7 | |
(unconditional_claims <- sum(average.claims * plist[["driver"]])) | |
## [1] 0.475 | |
plot(pn[["dag"]], main="Claims network example", | |
attrs = list(node = list(fillcolor = "lightgreen"), | |
edge = list(color = "blue"), | |
graph = list(rankdir = "LR"))) |
Next, I set the client’s evidence (0 claims in year one and 1 claim in year two) and propagate these back through my network to estimate the probabilities that the customer is either a good (73.68%) or a bad (26.32%) driver. Knowing that a good driver has on overage 0.4 claims a year and a bad driver 0.7 claims I predict the number of claims for my customer with the given claims history as 0.4789.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Scenario a driver with no claims in year 1 and one claim in year 2. | |
pn1 <- setEvidence(pn, nslist = list(year_1 = "0", year_2 = "1")) | |
pEvidence(pn1) | |
## [1] 0.1425 | |
(cond.driver <- querygrain(pn1, nodes = "driver", type = "marginal")) | |
## $driver | |
## driver | |
## good bad | |
## 0.7368 0.2632 | |
(conditional_claims <- sum(average.claims * cond.driver$driver)) | |
## [1] 0.4789 |
Alternatively I could have added a third node for year 3 and queried the network for the probabilities of 0, 1 or 2 claims given that the customer had zero claims in year 1 and one claim in year 2. The sum product of the number of claims and probabilities gives me again an expected claims number of 0.4789.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c3 <- cptable(~year_3 | driver, | |
values = c(0.7, 0.2, 0.1, 0.5, 0.3, 0.2), | |
levels = as.character(claims)) | |
plist2 <- compileCPT(list(d, c1, c2, c3)) | |
pn2 <- grain(plist2) | |
pn2 <- setEvidence(pn2, nslist = list(year_1 = "0", | |
year_2 = "1")) | |
(prob_year_3 <- querygrain(pn2, nodes = "year_3", | |
type = "marginal")$year_3) | |
## year_3 | |
## 0 1 2 | |
## 0.6474 0.2263 0.1263 | |
sum(prob_year_3 * claims) | |
## [1] 0.4789 |
References
[1] Klugman, S. A., Panjer, H. H. & Willmot, G. E. (2009), Loss Models: From Data to Decisions, Wiley Series in Proability and Statistics.[2] Søren Højsgaard (2012). Graphical Independence Networks with the gRain Package for R. Journal of Statistical Software, 46(10), 1-26. URL http://www.jstatsoft.org/v46/i10/
Session Info
R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Rgraphviz_2.6.0 gRain_1.2-2 gRbase_1.6-12 graph_1.40.0 loaded via a namespace (and not attached): [1] BiocGenerics_0.8.0 igraph_0.6.6 lattice_0.20-24 Matrix_1.1-0 [5] parallel_3.0.2 RBGL_1.38.0 stats4_3.0.2 tools_3.0.2
To leave a comment for the author, please follow the link and comment on their blog: mages' blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.