[This article was first published on Publishable Stuff, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t work a more sophisticated analysis is attempted (“That p-value is still to big, maybe a mixed models logistic regression will make it smaller…”). One reason for this is that the Pearson’s correlation test is conceptually quite simple and have assumption that makes it applicable in many situations (but it is definitely also used in many situations where underlying assumption are violated).
Since I’ve converted to “Bayesianism” I’ve been trying to figure out what Bayesian analyses correspond to the classical analyses. For t tests, chi-square tests and Anovas I’ve found Bayesian versions that, at least conceptually, are testing the same thing. Here are links to Bayesian versions of the t test, a chi-square test and an Anova, if you’re interested. But for some reason I’ve never encountered a discussion of what a Pearson’s correlation test would correspond to in a Bayesian context. Maybe this is because regression modeling often can fill the same roll as correlation testing (quantifying relations between continuous variables) or perhaps I’ve been looking in the wrong places.
The aim of this post is to explain how one can run the Bayesian counterpart of Pearson’s correlation test using R and JAGS. The model that a classical Pearson’s correlation test assumes is that the data follows a bivariate normal distribution. That is, if we have a list $x$ of pairs of data points $[[x_{1,1},x_{1,2}],[x_{2,1},x_{2,2}],[x_{3,1},x_{3,2}],…]$ then the $x_{i,1} \text{s}$ and the $x_{i,2} \text{s}$ are each assumed to be normally distributed with a possible linear dependency between them. This dependency is quantified by the correlation parameter $\rho$ which is what we want to estimate in a correlation analysis. A good visualization of a bivariate normal distribution with $\rho = 0.3$ can be found on the the wikipedia page on the multivariate normal distribution :
We will assume the same model and our Bayesian correlation analysis then reduces to estimating the parameters of a bivariate normal distribution given some data. One problem is that the bivariate normal distribution and, more general, the multivariate normal distribution isn’t parameterized using $\rho$, that is, we cannot estimate $\rho$ directly. The bivariate normal is parameterized using $\mu_1$ and $\mu_2$ the means of the two marginal distributions (the red and blue normal distribution in the graph above) and a covariance matrix $\Sigma$ which defines $ \sigma_1^2 $ and $\sigma_2^2$, the variances of the two marginal distributions, and the covariance, how much the marginal distributions vary together. So the covariance is another measure of how much two variables vary together and the covariance corresponding to a correlation of $\rho$ can be calculated as $\rho \cdot \sigma_1 \cdot \sigma_2$. So here is the model we want to estimate:
Add some flat priors on this (which could, of course, be made more informative) and we’re ready to roll:
$$\mu_1,\mu_2 \sim Normal(0, 1000)$$
$$\sigma_1,\sigma_2 \sim Uniform(0, 1000)$$
$$\rho \sim Uniform(-1, 1)$$
Implementation
So, how to implement this model? I’m going to do it with R and the JAGS sampler interfaced with R using the rjags package. First I’m going to simulate some data with a correlation of 0.7 to test the model with.
library(rjags)
library(mvtnorm) # to generate correlated data with rmvnorm.
library(car) # To plot the estimated bivariate normal distribution.
set.seed(31415)
mu <- c(10, 30)
sigma <- c(20, 40)
rho <- -0.7
cov_mat <- rbind(c( sigma[1]^2 , sigma[1]*sigma[2]*rho ),
c( sigma[1]*sigma[2]*rho, sigma[2]^2 ))
x <- rmvnorm(30, mu, cov_mat)
plot(x, xlim=c(-125, 125), ylim=c(-100, 150))
The simulated data looks quite correlated and a classical Pearson’s correlation test confirms this:
cor.test(x[, 1], x[, 2])
##
## Pearson’s product-moment correlation
##
## data: x[, 1] and x[, 2]
## t = -6.6, df = 28, p-value = 3.704e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8902 -0.5841
## sample estimates:
## cor
## -0.7802
The JAGS model below implements the bivariate normal model described above. One difference is that JAGS parameterizes the normal and multivariate normal distributions with precision instead of standard deviation or variance. The precision is the inverse of the variance so in order to use a covariance matrix as a parameter to dmnorm we have to inverse it first using the inverse function.
model_string <- "
model {
for(i in 1:n) {
x[i,1:2] ~ dmnorm(mu[], prec[ , ])
}
# Constructing the covariance matrix and the corresponding precision matrix.
prec[1:2,1:2] <- inverse(cov[,])
cov[1,1] <- sigma[1] * sigma[1]
cov[1,2] <- sigma[1] * sigma[2] * rho
cov[2,1] <- sigma[1] * sigma[2] * rho
cov[2,2] <- sigma[2] * sigma[2]
# Flat priors on all parameters which could, of course, be made more informative.
sigma[1] ~ dunif(0, 1000)
sigma[2] ~ dunif(0, 1000)
rho ~ dunif(-1, 1)
mu[1] ~ dnorm(0, 0.001)
mu[2] ~ dnorm(0, 0.001)
# Generate random draws from the estimated bivariate normal distribution
x_rand ~ dmnorm(mu[], prec[ , ])
}
"
An extra feature is that the model above generates random samples (x_rand) from the estimated bivariate normal distribution. These samples can be compared to the actual data in order to get a sense of how well the model fits the data. Now let’s use JAGS to sample from this model. I’m using the textConnection trick (described here) in order to run the model without having to first save the model string to a file.
The trace plots look sufficiently furry and stationary and looking at the density plots it looks like the model have captured the “true” parameters well (those used when we created the data). Looking at point estimates and credible intervals confirms this:
summary(mcmc_samples)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu[1] 9.621 4.3713 0.035692 0.07107
## mu[2] 31.988 8.1653 0.066669 0.13381
## rho -0.755 0.0816 0.000666 0.00142
## sigma[1] 24.531 3.3368 0.027245 0.05816
## sigma[2] 46.230 6.2924 0.051377 0.11245
## x_rand[1] 9.513 24.9313 0.203563 0.21112
## x_rand[2] 32.353 47.1290 0.384806 0.39614
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu[1] 0.970 6.779 9.640 12.485 18.178
## mu[2] 15.671 26.551 31.969 37.371 48.264
## rho -0.881 -0.814 -0.766 -0.708 -0.564
## sigma[1] 19.042 22.181 24.174 26.517 32.003
## sigma[2] 35.662 41.750 45.583 49.980 60.509
## x_rand[1] -39.002 -6.852 9.658 25.964 58.649
## x_rand[2] -60.836 1.619 32.223 63.050 125.009
The median of the rho parameter is -0.76 (95% CI: [-0.88, -0.55]) close to the true parameter value -0.7. Even if this is not that strange, we did use a bivariate normal distribution both to generate and model the data, it is nice when things work out 🙂 Now let’s calculate the probability that there actually is a negative correlation.
Given the model and the data it seems like the probability that there is a negative correlation is 100%. Accounting for the MCMC error I think it is fair to downgrade this probability to at least 99.9%. We can also use the random samples from the bivariate normal distribution estimated by the model to compare how well the model fits the data. The following code plots the original data with two superimposed ellipses where the inner ellipse covers 50% of the density of the distribution and the outer ellipse covers 95%.
So let’s use this model on some real data. The data.frame below contains the names, weights in kg and finishing times for all participants of the men’s 100 m semi-finals in the 2013 World Championships in Athletics. Well, those I could find the weights of anyway…
So, I know nothing about running (and I’m not sure this is a very representative data set…) but my hypothesis is that there should be a positive correlation between weight and finishing time. That is, the more you weigh the slower you run. Sounds like logic, right? Let’s look at the data…
plot(d$time, d$weight)
At a first glance it seems like my hypothesis is not supported by the data. I wonder what our model has to say about that?
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## -0.09905 0.22054 0.00180 0.00243
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## -0.5084 -0.2548 -0.1053 0.0498 0.3455
Seems like there is no support for my hypothesis, the posterior distribution of rho is centered around zero and, if anything, there might be a tiny negative correlation. So your weight doesn’t seem to influence how fast you run (if you are a runner in the 100 m semi finals, at least).
Related
To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff.