More on Exploring Correlations in R
[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
About a year ago I wrote a post about producing scatterplot matrices in R. These are handy for quickly getting a sense of the correlations that exist in your data. Recently someone asked me to pull out some relevant statistics (correlation coefficient and p-value) into tabular format to publish beside a scatterplot matrix. The built-in cor() function will produce a correlation matrix, but what if you want p-values for those correlation coefficients? Also, instead of a matrix, how might you get these statistics in tabular format (variable i, variable j, r, and p, for each i–j combination)? Here’s the code (you’ll need the PerformanceAnalytics package to produce the plot).Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Correlation matrix with p-values. See http://goo.gl/nahmV for documentation of this function | |
cor.prob <- function (X, dfr = nrow(X) - 2) { | |
R <- cor(X, use="pairwise.complete.obs") | |
above <- row(R) < col(R) | |
r2 <- R[above]^2 | |
Fstat <- r2 * dfr/(1 - r2) | |
R[above] <- 1 - pf(Fstat, 1, dfr) | |
R[row(R) == col(R)] <- NA | |
R | |
} | |
## Use this to dump the cor.prob output to a 4 column matrix | |
## with row/column indices, correlation, and p-value. | |
## See StackOverflow question: http://goo.gl/fCUcQ | |
flattenSquareMatrix <- function(m) { | |
if( (class(m) != "matrix") | (nrow(m) != ncol(m))) stop("Must be a square matrix.") | |
if(!identical(rownames(m), colnames(m))) stop("Row and column names must be equal.") | |
ut <- upper.tri(m) | |
data.frame(i = rownames(m)[row(m)[ut]], | |
j = rownames(m)[col(m)[ut]], | |
cor=t(m)[ut], | |
p=m[ut]) | |
} | |
# get some data from the mtcars built-in dataset | |
mydata <- mtcars[, c(1,3,4,5,6)] | |
# correlation matrix | |
cor(mydata) | |
# correlation matrix with p-values | |
cor.prob(mydata) | |
# "flatten" that table | |
flattenSquareMatrix(cor.prob(mydata)) | |
# plot the data | |
library(PerformanceAnalytics) | |
chart.Correlation(mydata) |
The cor() function will produce a basic correlation matrix. 12 years ago Bill Venables provided a function on the R help mailing list for replacing the upper triangle of the correlation matrix with the p-values for those correlations (based on the known relationship between t and r). The cor.prob() function will produce this matrix.
Finally, the flattenSquareMatrix() function will “flatten” this matrix to four columns: one column for variable i, one for variable j, one for their correlation, and another for their p-value (thanks to Chris Wallace on StackOverflow for helping out with this one).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> cor(mydata) | |
mpg disp hp drat wt | |
mpg 1.0000000 -0.8475514 -0.7761684 0.6811719 -0.8676594 | |
disp -0.8475514 1.0000000 0.7909486 -0.7102139 0.8879799 | |
hp -0.7761684 0.7909486 1.0000000 -0.4487591 0.6587479 | |
drat 0.6811719 -0.7102139 -0.4487591 1.0000000 -0.7124406 | |
wt -0.8676594 0.8879799 0.6587479 -0.7124406 1.0000000 | |
> cor.prob(mydata) | |
mpg disp hp drat wt | |
mpg NA 9.380327e-10 1.787835e-07 1.776240e-05 1.293958e-10 | |
disp -0.8475514 NA 7.142679e-08 5.282022e-06 1.222322e-11 | |
hp -0.7761684 7.909486e-01 NA 9.988772e-03 4.145827e-05 | |
drat 0.6811719 -7.102139e-01 -4.487591e-01 NA 4.784260e-06 | |
wt -0.8676594 8.879799e-01 6.587479e-01 -7.124406e-01 NA | |
> flattenSquareMatrix(cor.prob(mydata)) | |
i j cor p | |
1 mpg disp -0.8475514 9.380327e-10 | |
2 mpg hp -0.7761684 1.787835e-07 | |
3 disp hp 0.7909486 7.142679e-08 | |
4 mpg drat 0.6811719 1.776240e-05 | |
5 disp drat -0.7102139 5.282022e-06 | |
6 hp drat -0.4487591 9.988772e-03 | |
7 mpg wt -0.8676594 1.293958e-10 | |
8 disp wt 0.8879799 1.222322e-11 | |
9 hp wt 0.6587479 4.145827e-05 | |
10 drat wt -0.7124406 4.784260e-06 |
Finally, the chart.Correlation() function from the PerformanceAnalytics package produces a very nice scatterplot matrix, with histograms, kernel density overlays, absolute correlations, and significance asterisks (0.05, 0.01, 0.001):
To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.