Going Bayes #rstats
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.
Due to the latest tidyr-update, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayes-stuff is now avaible in my packages on CRAN.
Finding shape or location parameters from distributions
The following functions are included in the sjstats-package. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta()
, find_normal()
or find_cauchy()
help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal()
, using the information given in the text:
library(sjstats) find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8) #> $mean #> [1] 53.78387 #> #> $sd #> [1] 30.48026
High Density Intervals for MCMC samples
The hdi()
-function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanreg
-objects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDI-values, e.g. if you need these intervals on an expontiated scale.
library(rstanarm) fit <- stan_glm(mpg ~ wt + am, data = mtcars, chains = 1) hdi(fit) #> term hdi.low hdi.high #> 1 (Intercept) 32.158505 42.341421 #> 2 wt -6.611984 -4.022419 #> 3 am -2.567573 2.343818 #> 4 sigma 2.564218 3.903652 # fit logistic regression model fit <- stan_glm( vs ~ wt + am, data = mtcars, family = binomial("logit"), chains = 1 ) hdi(fit, prob = .89, trans = exp) #> term hdi.low hdi.high #> 1 (Intercept) 4.464230e+02 3.725603e+07 #> 2 wt 6.667981e-03 1.752195e-01 #> 3 am 8.923942e-03 3.747664e-01
Marginal effects for rstanarm-models
The ggeffects-package creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()
-method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr
(pkg MASS), hurdle
and zeroinfl
(pkg pscl), betareg
(pkg betareg), truncreg
(pkg truncreg), coxph
(pkg survival) and stanreg
(pkg rstanarm).
ggpredict()
is the main function that computes marginal effects. Predictions for stanreg
-models are based on the posterior distribution of the linear predictor (posterior_linpred()
), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()
) for inference and model checking, and you can do so using the ppd
-argument when calling ggpredict()
, however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred()
is the default for stanreg
-models with ggpredict()
.
Here is an example with two plots, one without raw data and one including data points:
library(sjmisc) library(rstanarm) library(ggeffects) data(efc) # make categorical efc$c161sex <- to_label(efc$c161sex) # fit model m <- stan_glm(neg_c_7 ~ c160age + c12hour + c161sex, data = efc) dat <- ggpredict(m, terms = c("c12hour", "c161sex")) dat #> # A tibble: 128 x 5 #> x predicted conf.low conf.high group #> <dbl> <dbl> <dbl> <dbl> <fctr> #> 1 4 10.80864 10.32654 11.35832 Male #> 2 4 11.26104 10.89721 11.59076 Female #> 3 5 10.82645 10.34756 11.37489 Male #> 4 5 11.27963 10.91368 11.59938 Female #> 5 6 10.84480 10.36762 11.39147 Male #> 6 6 11.29786 10.93785 11.61687 Female #> 7 7 10.86374 10.38768 11.40973 Male #> 8 7 11.31656 10.96097 11.63308 Female #> 9 8 10.88204 10.38739 11.40548 Male #> 10 8 11.33522 10.98032 11.64661 Female #> # ... with 118 more rows plot(dat) plot(dat, rawdata = TRUE)
As you can see, if you work with labelled data, the model-fitting functions from the rstanarm-package preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()
-function.
Next…
Next I will integrate ggeffects into my sjPlot-package, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()
-function which will replace former single functions like sjp.lm()
, sjp.glm()
, sjp.lmer()
or sjp.glmer()
. plot_model()
should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…
Tagged: Bayes, data visualization, ggplot, R, rstanarm, rstats, sjPlot, Stan
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.