R’s Garden of Probability Distributions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
If you type ?Distributions at the R console you get a list of the 21 probability distributions included in the stats package that ships with base R. The same list appears in the Introduction to R Manual on CRAN and in most of the many fine introductory books available for the R language. These are indeed fundamental distributions, sufficient for most elementary work in probability and statistics. The fact that the R functions implementing these distributions all follow same syntax greatly eases a beginner's task of trying to get some useful work done with a minimum of memorization.
The following figure shows plots of the cumulative distribution pgamma()and probability density function dgamma() along with the histogram of random draws from a gamma distribution rgamma(2,2)with shape and scale parameters both set to 2.
However, if a person isn’t familiar with how information about R is organized on CRAN, he or she might conclude: “that’s it” or most of it anyway, with respect to R and probability distributions. Imagine the surprise then of a person with such modest expectations about R’s probability distributions accidently stumbling into the overgrown garden of R’s Probability Distributions Task View. I think my first reaction was kind of glazed over inability to take it all in.
However, if you just let your eyes relax and pick out a flower with which you are familiar, binomial for example, you can see that the chief gardener Christophe Dutang, listed as the maintainer of the Task View, and the eight individuals whom acknowledges have done a remarkable job of organizing the distributions according to their genus (discrete or continuous), species (binomial in this case) and variety (truncated binomial and zero inflated binomial). I can’t imagine the number of volunteer hours took to assemble this page, and keeping it up to date can’t be easy either. I spent a half hour or so just trying to count the distributions. Not counting copulas, random matrices and other exotica I came up with 31 discrete, 133 continuous and 9 mixture distributions. Others may count more or less depending on how they group things together. It seems as if few people outside of the folks at Wikipedia have given much thought to the taxonomy of probability distributions and only Mathematica 9 which includes 130 probability distributions comes close to cultivating so many distributions in one coherent system. (To be fair, the online documentation for SAS, Matlab and SPSS is so distributed that it is difficult to determine how many probability distrbutions have ben implemented in these software packages.)
While the Probability Distributions Task view may be the place to start for information about probability distributions, the complete R documentation is itself an open ended, organic system that depends on the communication style of package authors and the experiences of everyone who leaves a record of their attempts to work with probability distributions.
The entire ecosystem of R documentation for a probability distribution function starts with the command line help ( e.g. ?pgamma) and the package pdf on CRAN that includes the function, but may also include, vignettes, external web pages, blog posts and questions and discussions on help bulletin boards such as the R mailing lists and StackOverflow. For some typical examples, consider that the actuar package from Vincet Goulet et al. which provides a number of distributions of interest to acturies has six vignettes, while Thomas Yee's VGAM package for Vector Generalized Linear and Additive Models, a source for many R probability distributions, has a web page as well as a vignette.
John D. Cook’s clickable diagram for elementary probability distributions is hosted on his private website while and the paper by Delignette-Muller et al. on fitting distributions with R’s fitdistrplus package is hosted on an academic website. Mage's post from December 2011 on fitting distributions in R is an example of the many blog posts that deserve a second look.
As a final example of how the community comes to play a part of the extended documentation for R, consider my attempt get a handle on the Cauchy distribution. Here I ran the below and got four very different looking plots. This is not unexpected given that I’m working with random draws from a probability distribution for which both the mean and variance are not defined. But why only two bins for the histograms?
Well, I wasn’t the first person to pause for a moment over this. Someone recently asked this question on StackOverflow and received some good advice.
Hats off and thank you to everyone involved in cultivating R’s garden of probability distributions
# Cauchy plots n <- 10000 location <- -1 scale <- 4 par(mfrow=c(2,2)) # Make four plots for(i in 1:4){ y <- rcauchy(n, location, scale) hist(y, freq = FALSE, col = rainbow(6), main="random draw from rcauchy(-1,4)") fd <- function(y)dcauchy(y,shape,scale) curve(fd, col = "black", add = TRUE,lwd=2) rug(y,col="grey") }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.