Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Stop using bivariate correlations for variable selection
Something I've never understood is the widespread calculation and reporting of univariate and bivariate statistics in applied work, especially when it comes to model selection. Bivariate statistics are, at best, useless for multi-variate model selection and, at worst, harmful. Since nearly all questions of interest are highly dimensional, the use of bivariate statistics is worrisome.
What typically happens is a researcher sits down with their statistical software of choice and they compute a correlation between their response variable and their collection of possible predictors. From here, they toss out potential predictors that either have low correlation or for which the correlation is not significant.
The concern here is that it is possible for the correlation between the marginal distributions of the response and a predictor to be almost zero or non-significant and for that predictor to be an important element in the data generating pathway.
Consider the following simulation that illustrates the difference between the
marginal relationship between the response and a predictor and the true multi-
variable relationship. The function genY
generates the response using user
provided values for \( \beta \) and adds in a little white noise.
genY <- function(betas, X, stdPercent = 0.25) { modelMatrix <- cbind(1, X) yhat <- modelMatrix %*% betas y <- yhat + rnorm(nrow(yhat), 0, stdPercent * mean(yhat)) y }
The function cor
returns the correlation between two values and works with
a vector and a matrix argument but cor.test
requires two vectors. For this
I wrote a simple function sigCor
that returns TRUE
if a correlation between
the vector x
and y
is significant at a user provided threshold and FALSE
otherwise. This function is designed to work with apply
to filter our
predictor set.
sigCor <- function(x, y, threshold = 0.2) { p <- cor.test(x, y)$p.value p <= threshold }
The main function then generates a random covariance matrix and draws from a
multivariate normal centered at 0 with that covariance matrix. The values of
the matrix X
are then checked for correlation with the response y
. If the
absolute value of the correlation exceeds some threshold or if the p value for
the significance of the correlation is less than some threshold, the variable is
kept. Otherwise it is dropped. The function returns the number of retained
variables for each selection method.
simulateSelection <- function(nx, n, corLim = 0.2, pLim = 0.2) { require(MASS) require(clusterGeneration) betas <- rep(1, nx + 1) mu <- rep(0, nx) sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma X <- mvrnorm(n, mu, sigma) y <- genY(betas, X) xcor <- cor(y, X) limitedX <- X[, abs(xcor) >= corLim] pX <- X[, apply(X, 2, sigCor, y, pLim)] nCoef <- c(ncol(X) + 1, ncol(limitedX) + 1, ncol(pX) + 1) }
If we keep the defaults of corLim = 0.2
and pLim = 0.2
(keep any correlation
where the absolute value exceeds 0.2 and any correlation with a p value less
than 0.2) and use replicate
to run the function 1000 times with 10 predictors
and \( n = 1000 \), we can calculate the probability that the different method
selects the correct model. (Note that we set all of values of betas
to 1, so
all 11 coefficient estimates should be non-zero and all of the predictors are
correlated with the outcome).
results <- replicate(1000, (simulateSelection(10, 1000))) correctModelFraction <- apply(results == 11, 1, mean) correctModelFraction ## [1] 1.000 0.026 0.573
Using a filter based on the size of the correlation only yields the correct model 2.6% of the time. Using a relatively high p value as a threshold does better returning the correct model 57.3% of the time.
We can explore the effect of varying the value of pLim
using a few simple
modifications to the above code.
simulateSelectionP <- function(nx, n, pLim) { require(MASS) require(clusterGeneration) betas <- rep(1, nx + 1) mu <- rep(0, nx) sigma <- genPositiveDefMat(nx, covMethod = "unifcorrmat")$Sigma X <- mvrnorm(n, mu, sigma) y <- genY(betas, X) pX <- X[, apply(X, 2, sigCor, y, pLim)] correct <- (ncol(pX)) == nx } varyP <- function(nx, n, p) { fractionCorrect <- numeric(100 * length(p)) j <- 1 for (i in 1:length(p)) { set.seed(42) fractionCorrect[(j):(j + 99)] <- replicate(100, simulateSelectionP(nx, n, p[i])) j <- j + 100 } fractionCorrect }
Using these new functions, we can calculate the fraction of the time that a
threshold of p
yields the correct model. If we vary p
from \( 0.01--0.50 \) and
plot the resulting curve,
prCorrect <- varyP(10, 1000, seq(0.01, 0.5, length.out = 20)) results <- data.frame(prCorrect = prCorrect, limit = rep(seq(0.01, 0.5, length.out = 20), each = 100)) meanByLimit <- aggregate(results$prCorrect, list(results$limit), mean) sdByLimit <- aggregate(results$prCorrect, list(results$limit), sd) meanByLimit$sd <- sdByLimit$x meanByLimit$upper <- meanByLimit$x + qnorm(0.975) * meanByLimit$sd/sqrt(100) meanByLimit$lower <- meanByLimit$x - qnorm(0.975) * meanByLimit$sd/sqrt(100) ggplot(meanByLimit, aes(x = Group.1, y = x, ymin = lower, ymax = upper)) + geom_line() + geom_ribbon(alpha = 0.25) + theme_bw() + scale_x_continuous("Threshold for p value") + scale_y_continuous("Fraction of models in simulation that are correct")
we see that, as expected, the probability of getting the correct model goes up as we increase the threshold value. We have about a 75% shot at getting the right model using this method when we set the limit on the p value to be 0.50 compared to under 30% at a p value of 0.01.
However, this raises an interesting question: if we increase the threshold for the p value, what is the value of using the bivariate correlation at all?
randomRate <- function(n, pLim) { x <- rnorm(n) y <- rnorm(n) sigCor(x, y, pLim) } p <- seq(0.01, 0.5, length.out = 20) rate <- numeric(length(p)) for (i in 1:length(p)) { rate[i] <- mean(replicate(1000, randomRate(1000, p[i]))) } rbind(p, rate) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] ## p 0.010 0.03579 0.06158 0.08737 0.1132 0.1389 0.1647 0.1905 0.2163 ## rate 0.009 0.03900 0.06400 0.10100 0.1110 0.1290 0.1660 0.1870 0.2210 ## [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] ## p 0.2421 0.2679 0.2937 0.3195 0.3453 0.3711 0.3968 0.4226 0.4484 0.4742 ## rate 0.2250 0.2920 0.3180 0.3180 0.3210 0.4010 0.3720 0.4560 0.4130 0.4660 ## [,20] ## p 0.5 ## rate 0.5
As the threshold increases, the value of the bivariate correlation as a filter goes down and goes down fast. At the limit required for there to be a 50% chance of the model containing all relevant predictors, the rate of improperly included variables would go up to nearly 20%.
The real problem comes into play here — the bivariate comparsions selects for the wrong variables by over-emphasizing the relationship between the marginal distributions. Because the selected model is incomplete and important variables are omitted, the resulting parameter estimates are biased and inaccurate.
The bivariate comparsion is a terrible way to select relevant variables for a highly dimensional model as the function of interest is relating the all of the predictors to the outcome. It can neither rule in nor rule out a predictor from a model. A better approach is to generate the model using domain knowledge and a model of the expected data generating process. If the process is to remain strictly data driven, methods like cross-validation or AIC/BIC provide better measures of model quality and predictor importance than the bivariate correlation coefficient.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.