Create a publication-ready correlation matrix, with significance levels, in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:
data:image/s3,"s3://crabby-images/0230e/0230e558bedb765febec9b1152110c56dbdf956a" alt="FACTOR ANALYSIS"
In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:
data:image/s3,"s3://crabby-images/780af/780af741f9fafa56b98f80cd66c79445c9d26fa1" alt="Table 4 from Family moderators of relation between community ..."
Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..
I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.
Yes, there is the cor
function, but it does not include significance levels.
data:image/s3,"s3://crabby-images/994c2/994c2a9e5abec18383740310a0ce17f10fcd41e5" alt=""
Then there the (in)famous Hmisc
package, with its rcorr
function. But this tool provides a whole new range of issues.
What’s this storage.mode
, and what are we trying to coerce again?
data:image/s3,"s3://crabby-images/29f9a/29f9ac2dd3802a5339ca8e4e23cc039743abd63b" alt=""
Soon you figure out that Hmisc::rcorr
only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…
Yet, the output is all but publication-ready!
data:image/s3,"s3://crabby-images/7c5c2/7c5c21beba5e906e80cb494ca19cddbcd10db061" alt=""
You wanted one correlation matrix, but now you have two… Double the trouble?
To spare future scholars the struggle of the early day R programming, I would like to share my custom function correlation_matrix
.
My correlation_matrix
takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!
You can specify many formatting options in correlation_matrix
.
For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:
Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:
For other formatting options, do have a look at the source code below.
Now, to make matters even more easy, I wrote a second function (save_correlation_matrix
) to directly save any created correlation matrices:
data:image/s3,"s3://crabby-images/b2ddd/b2ddd59876015a07d56d9f7c1a09336f73953a64" alt=""
Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!
data:image/s3,"s3://crabby-images/18753/1875377c97d9f1f70932f8ccfdf7f81fe1020413" alt=""
If you are looking for ways to visualize your correlations do have a look at the packages corrr
and corrplot
.
I hope my functions are of help to you!
Do reach out if you get to use them in any of your research papers!
I would be super interested and feel honored.
correlation_matrix
#' correlation_matrix #' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend. #' #' @param df dataframe; containing numeric and/or logical columns to calculate correlations for #' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"` #' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3` #' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.` #' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"` #' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE` #' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE` #' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string) #' #' @return a correlation matrix #' @export #' #' @examples #' `correlation_matrix(iris)` #' `correlation_matrix(mtcars)` correlation_matrix <- function(df, type = "pearson", digits = 3, decimal.mark = ".", use = "all", show_significance = TRUE, replace_diagonal = FALSE, replacement = ""){ # check arguments stopifnot({ is.numeric(digits) digits >= 0 use %in% c("all", "upper", "lower") is.logical(replace_diagonal) is.logical(show_significance) is.character(replacement) }) # we need the Hmisc package for this require(Hmisc) # retain only numeric and boolean columns isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1)) if (sum(!isNumericOrBoolean) > 0) { cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n') } df = df[isNumericOrBoolean] # transform input data frame to matrix x <- as.matrix(df) # run correlation analysis using Hmisc package correlation_matrix <- Hmisc::rcorr(x, type = ) R <- correlation_matrix$r # Matrix of correlation coeficients p <- correlation_matrix$P # Matrix of p-value # transform correlations to specific character format Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark) # if there are any negative numbers, we want to put a space before the positives to align all if (sum(R < 0) > 0) { Rformatted = ifelse(R > 0, paste0(' ', Rformatted), Rformatted) } # add significance levels if desired if (show_significance) { # define notions for significance levels; spacing is important. stars <- ifelse(is.na(p), " ", ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " ")))) Rformatted = paste0(Rformatted, stars) } # build a new matrix that includes the formatted correlations and their significance stars Rnew <- matrix(Rformatted, ncol = ncol(x)) rownames(Rnew) <- colnames(x) colnames(Rnew) <- paste(colnames(x), "", sep =" ") # replace undesired values if (use == 'upper') { Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement } else if (use == 'lower') { Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement } else if (replace_diagonal) { diag(Rnew) <- replacement } return(Rnew) }
save_correlation_matrix
#' save_correlation_matrix #' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend #' @param df dataframe; passed to `correlation_matrix` #' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv` #' @param ... any other arguments passed to `correlation_matrix` #' #' @return NULL #' #' @examples #' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')` #' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')` save_correlation_matrix = function(df, filename, ...) { write.csv2(correlation_matrix(df, ...), file = filename) }
Sign up to keep up to date on the latest R, Data Science & Tech content:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.