Use domain knowledge to review prior distributions

[This article was first published on R | mages' blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

At the Insurance Data Science conference, both Eric Novik and Paul-Christian Bürkner emphasised in their talks the value of thinking about the data generating process when building Bayesian statistical models. It is also a key step in Michael Betancourt’s Principled Bayesian Workflow.

In this post, I will discuss in more detail how to set priors, and review the prior and posterior parameter distributions, but also the prior predictive distributions with brms (Bürkner (2017)).

The prior predictive distribution shows me how the model behaves before I use my data. Thus, I can check if the model describes the data generating process reasonably well, before I go through the lengthy process of fitting the model.

As an example, I will get back to the non-linear hierarchical growth curve model, which is also part of the brms package vignette on non-linear models.

Load packages and data

First I load the relevant packages and data set.

library(rstan)
library(brms)
library(bayesplot)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
theme_update(text = element_text(family = "sans"))
library(data.table)
url <- "https://raw.githubusercontent.com/mages/diesunddas/master/Data/ClarkTriangle.csv"
loss <- fread(url)

It is the aim to use growth curves to model the claims payments of insurance losses from different accident years (1991 - 2000) over time (development months) and predict future payments.

Prior predictive distribution

I will start with the same model as in the brms vignette, but instead of fitting the model, I set the parameter sample_prior = "only" to generate samples from the prior predictive distribution only, i.e. the data will be ignored and only the prior distributions will be used.

nlform1 <- bf(cum ~ ult * (1 - exp(-(dev/theta)^omega)),
             ult ~ 1 + (1|AY), omega ~ 1, theta ~ 1, 
             nl = TRUE)
m1 <- brm(nlform1, data = loss, family = gaussian(),
  prior = c(
    prior(normal(5000, 1000), nlpar = "ult"),
    prior(normal(1, 2), nlpar = "omega"),
    prior(normal(45, 10), nlpar = "theta")
  ),
  sample_prior = "only", seed = 1234,
  control = list(adapt_delta = 0.9)
)

I can use the same plots to review the prior predictive distribution as I would have done for the posterior predictive distribution.

conditions <- data.frame(AY = unique(loss$AY))
rownames(conditions) <- unique(loss$AY)
me_loss_prior1 <- marginal_effects(
  m1, conditions = conditions, 
  re_formula = NULL, method = "predict"
)
p0 <- plot(me_loss_prior1, ncol = 5, points = TRUE, plot = FALSE)
p0$dev + ggtitle("Prior predictive distributions")

Perhaps this doesn’t look too bad, but the model generates also many negative claims payments, as I have assumed a Gaussian distribution. Yet, customers rarely pay claims back to the insurance company.

In addition, this model simplifies the variance structure to a constant \(\sigma^2\). In (Guszcza (2008)) Jim Guszcza suggested a variance that is proportional to the claims amount.

Review model code and priors

Let’s look at a part of the brm-generated Stan code that describes the priors to better understand the model:

  // priors including all constants 
  target += normal_lpdf(b_ult | 5000, 1000); 
  target += normal_lpdf(b_omega | 1, 2); 
  target += normal_lpdf(b_theta | 45, 10); 
  target += student_t_lpdf(sigma | 3, 0, 1964)
    - 1 * student_t_lccdf(0 | 3, 0, 1964); 
  target += student_t_lpdf(sd_1 | 3, 0, 1964)
    - 1 * student_t_lccdf(0 | 3, 0, 1964); 
  target += normal_lpdf(z_1[1] | 0, 1);
  // likelihood including all constants 
  if (!prior_only) { 
    target += normal_lpdf(Y | mu, sigma);
  } 

The code shows the default priors set by brm for sigma and sd_1, which I didn’t set explicitly. In addition, the last line shows the switch that turns off sampling from the posterior distributions.

Putting the model in ‘Greek’ makes it more readable:

\[ \begin{align} loss_{AY}(t) & \sim \mathcal{N}\left( \mu_{AY}(t), \sigma^2\right) \\ \mu_{AY}(t) & = \gamma_{AY} \cdot G(t; \omega, \theta)\\ G(t; \omega, \theta) & = 1 - \exp\left(-\left({t/\theta}\right)^\omega\right)\\ \gamma_{AY} & = \gamma + \gamma_{AY}^0 \\ \mbox{Priors}&:\\ \gamma & \sim \mathcal{N}(5000, 1000)\\ \omega & \sim \mathcal{N}\left(1, 2^2\right)\\ \theta & \sim \mathcal{N}\left(45, 10^2\right) \\ \sigma & \sim \mbox{Student-t}\left(3, 0, 1964\right)^+\\ \gamma_{AY}^0 & \sim \mathcal{N}(0, \sigma_{\gamma^0}^2) \\ \sigma_{\gamma^0} & \sim \mbox{Student-t}\left(3, 0, 1964\right)^+\\ \end{align} \]

Utilise domain knowledge

There are a few aspects that I would like to change to the original model:

  • Use the loss ratio instead of the loss amounts to make data from the different years more comparable and to bring values closer to 1
  • Change the distribution family from Gaussian to lognormal to avoid negative payments being generated
  • Assume a constant \(\sigma\) parameter, which means a constant coefficient of variation in case of the lognormal distribution, i.e. the standard deviation is proportional to the mean
  • Use development year instead of development month to reduce the scale of \(t\)
  • Use parameter \(\phi=1/\theta\), since I believe \(0< \phi < 1\).
  • Shrink uncertainty and shift \(\omega\), as I believe \(\omega\) will be around 1.25
  • With the move to loss ratios the standard deviations \(\sigma_{\gamma^0}\) and \(\sigma\) will be much smaller. Therefore, I reduce the scale parameter and increase the degrees of freedoms from 3 to 5 for the Student-t distribution

My new model looks like this now:

\[ \begin{align} \ell_{AY}(t) & \sim \log \mathcal{N}\left(\log(\mu_{AY}(t)), \sigma^2\right) \\ \mu_{AY}(t) & = \gamma_{AY} \cdot G(t; \omega, \phi)\\ G(t;\omega, \phi) & = 1 - \exp\left(-\left(t\cdot \phi\right)^\omega\right)\\ \gamma_{AY} & = \gamma + \gamma_{AY}^0 \\ \mbox{Priors}&:\\ \gamma & \sim \log \mathcal{N}\left(\log(0.5), \log(1.2)^2\right)\\ \omega & \sim \mathcal{N}\left(1.25, 0.25^2\right)^+\\ \phi & \sim \mathcal{N}\left(0.25, 0.25^2\right)^+ \\ \sigma & \sim \mbox{Student-t}\left(5, 0, 0.25\right)^+\\ \gamma_{AY}^0 & \sim \mathcal{N}(0, \sigma_{\gamma^0}^2) \\ \sigma_{\gamma^0} & \sim \mbox{Student-t}\left(5, 0, 0.25\right)^+\\ \end{align} \]

Assuming a constant coefficient of variation across development time seems broadly okay:

loss <- loss[, `:=`(loss_ratio = cum/loss$premium,
                    dev_year = (dev+6)/12)]
print(
  loss[, list(`CV(loss_ratio)` = sd(loss_ratio)/mean(loss_ratio)), by=dev_year], 
  digits=1)
##     dev_year CV(loss_ratio)
##  1:        1           0.14
##  2:        2           0.08
##  3:        3           0.08
##  4:        4           0.15
##  5:        5           0.13
##  6:        6           0.09
##  7:        7           0.11
##  8:        8           0.14
##  9:        9           0.21
## 10:       10             NA

To review if my new model makes sense I start by generating samples from the priors again:

nlform2 <- bf(loss_ratio ~ log(ulr * (1 - exp(-(dev_year*phi)^omega))),
             ulr ~ 1 + (1|AY), omega ~ 1, phi ~ 1, 
             nl = TRUE)
m2 <- brm(nlform2, data = loss, 
  family = lognormal(link = "identity", link_sigma = "log"),
  prior = c(
    prior(lognormal(log(0.5), log(1.2)), nlpar = "ulr", lb=0),
    prior(normal(1.25, 0.25), nlpar = "omega", lb=0),
    prior(normal(0.25, 0.25), nlpar = "phi", lb=0),
    prior(student_t(5, 0, 0.25), class = "sigma"),
    prior(student_t(5, 0, 0.25), class = "sd", nlpar="ulr")
    ), 
    sample_prior = "only", seed = 1234
)

Prior parameter distributions

library(bayesplot)
theme_update(text = element_text(family = "sans"))
mcmc_areas(
   as.array(m2), 
   pars = c("b_ulr_Intercept", "b_omega_Intercept",
            "b_phi_Intercept",
            "sd_AY__ulr_Intercept", "sigma"),
   prob = 0.8, # 80% intervals
   prob_outer = 0.99, # 99%
   point_est = "mean"
) + ggplot2::labs(
  title = "Prior parameter distributions",
  subtitle = "with medians and 80% intervals"
)

The prior parameter distributions look very similar to what I would expect, given the plot above.

Prior predictive distributions

me_loss_prior2 <- marginal_effects(
  m2, conditions = conditions, 
  re_formula = NULL, method = "predict"
)
p1 <- plot(me_loss_prior2, ncol = 5, points = TRUE, plot = FALSE)
p1$dev + ggtitle("Prior predictive distributions")

The prior predictive distributions look also more in line with the data.

I am happy with the model as it is. In my next step, I start using the data to fit the model.

Generate posterior samples

To fit my model with the data I call update and set the parameter sample_prior="no". Note, the model doesn’t need to be recompiled.

(fit_m1 <- update(m2, sample_prior="no", seed = 1234))
##  Family: lognormal 
##   Links: mu = identity; sigma = identity 
## Formula: loss_ratio ~ log(ulr * (1 - exp(-(dev_year * phi)^omega))) 
##          ulr ~ 1 + (1 | AY)
##          omega ~ 1
##          phi ~ 1
##    Data: loss (Number of observations: 55) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~AY (Number of levels: 10) 
##                   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(ulr_Intercept)     0.04      0.01     0.02     0.07       1117 1.00
## 
## Population-Level Effects: 
##                 Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## ulr_Intercept       0.42      0.02     0.38     0.47       1759 1.00
## omega_Intercept     1.86      0.05     1.76     1.95       2558 1.00
## phi_Intercept       0.26      0.01     0.23     0.28       2101 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.10      0.01     0.08     0.12       2379 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model looks well behave at a first glance.

Posterior parameter distributions

The plot of the posterior parameter distributions shows nicely how my priors shrunk:

mcmc_areas(
   as.array(fit_m1), 
   pars = c("b_ulr_Intercept", "b_omega_Intercept",
            "b_phi_Intercept",
            "sd_AY__ulr_Intercept", "sigma"),
   prob = 0.8, # 80% intervals
   prob_outer = 0.99, # 99%
   point_est = "mean"
) + ggplot2::labs(
  title = "Posterior parameter distributions",
  subtitle = "with medians and 80% intervals"
)

Posterior predictive distributions

Finally, it is time to plot the posterior predictive distributions:

me_loss_posterior <- marginal_effects(
  fit_m1, conditions = conditions, 
  re_formula = NULL, method = "predict"
)
p2 <- plot(me_loss_posterior, ncol = 5, points = TRUE, plot = FALSE)
p2$dev + ggtitle("Posterior predictive distributions")

Conclusions

The plot looks good, but the model underestimates the claims development for the 1992 and 1993 accident years.

  • Are those years and the data outliers?
  • Should the model allow the parameters \(\omega\) and \(\theta\) to vary by year?
  • Should I consider a different growth curve family?

Tapping into expert knowledge can be invaluable. However, many domain knowledge experts will not be statisticians and will find it difficult to form an opinion on the ‘Greek’ model or the parameters.

Yet, often experts will have a view on the data generated by the model. Showing them the output of the prior and posterior predictive distributions can be a lot more fruitful.

And as I said earlier, don’t forget they are experts, and hence like/likely to disagree with the non-expert. Use this bias to your advantage!

Session Info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2      latticeExtra_0.6-28 RColorBrewer_1.1-2 
##  [4] lattice_0.20-38     data.table_1.11.8   bayesplot_1.6.0    
##  [7] brms_2.6.0          Rcpp_1.0.0          rstan_2.18.2       
## [10] StanHeaders_2.18.0  ggplot2_3.1.0      
## 
## loaded via a namespace (and not attached):
##  [1] Brobdingnag_1.2-6    gtools_3.8.1         threejs_0.3.1       
##  [4] shiny_1.2.0          assertthat_0.2.0     stats4_3.5.1        
##  [7] yaml_2.2.0           pillar_1.3.0         backports_1.1.2     
## [10] glue_1.3.0           digest_0.6.18        promises_1.0.1      
## [13] colorspace_1.3-2     htmltools_0.3.6      httpuv_1.4.5        
## [16] Matrix_1.2-15        plyr_1.8.4           dygraphs_1.1.1.6    
## [19] pkgconfig_2.0.2      bookdown_0.8         purrr_0.2.5         
## [22] xtable_1.8-3         mvtnorm_1.0-8        scales_1.0.0        
## [25] processx_3.2.0       later_0.7.5          tibble_1.4.2        
## [28] DT_0.5               shinyjs_1.0          withr_2.1.2         
## [31] lazyeval_0.2.1       cli_1.0.1            magrittr_1.5        
## [34] crayon_1.3.4         mime_0.6             evaluate_0.12       
## [37] ps_1.2.1             nlme_3.1-137         xts_0.11-2          
## [40] pkgbuild_1.0.2       blogdown_0.9         colourpicker_1.0    
## [43] rsconnect_0.8.12     tools_3.5.1          loo_2.0.0           
## [46] prettyunits_1.0.2    matrixStats_0.54.0   stringr_1.3.1       
## [49] munsell_0.5.0        callr_3.0.0          compiler_3.5.1      
## [52] rlang_0.3.0.1        debugme_1.1.0        grid_3.5.1          
## [55] ggridges_0.5.1       htmlwidgets_1.3      crosstalk_1.0.0     
## [58] igraph_1.2.2         miniUI_0.1.1.1       labeling_0.3        
## [61] base64enc_0.1-3      rmarkdown_1.10       codetools_0.2-15    
## [64] gtable_0.2.0         curl_3.2             inline_0.3.15       
## [67] abind_1.4-5          reshape2_1.4.3       markdown_0.8        
## [70] R6_2.3.0             gridExtra_2.3        rstantools_1.5.1    
## [73] zoo_1.8-4            knitr_1.20           bridgesampling_0.6-0
## [76] dplyr_0.7.8          shinythemes_1.1.2    bindr_0.1.1         
## [79] shinystan_2.5.0      rprojroot_1.3-2      stringi_1.2.4       
## [82] parallel_3.5.1       tidyselect_0.2.5     xfun_0.4            
## [85] coda_0.19-2

References

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.

Guszcza, James. 2008. “Hierarchical Growth Curve Models for Loss Reserving.” In, 146–73. CAS Fall Forum; https://www.casact.org/pubs/forum/08fforum/7Guszcza.pdf.

To leave a comment for the author, please follow the link and comment on their blog: R | mages' blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)