Your strongly correlated data is probably nonsense
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Use of the Pearson correlation co-efficient is common in genomics and bioinformatics, which is OK as it goes (I have used it extensively myself), but it has some major drawbacks – the major one being that Pearson can produce large coefficients in the presence of very large measurements.
This is best shown via example in R:
# let's correlate some random data g1 <- rnorm(50) g2 <- rnorm(50) cor(g1, g2) # [1] -0.1486646
So we get a small, -ve correlation from correlating two sets of 50 random values. If we ran this 1000 times we would get a distribution around zero, as expected.
Let's add in a single, large value:
# let's correlate some random data with the addition of a single, large value g1 <- c(g1, 10) g2 <- c(g2, 11) cor(g1, g2) # [1] 0.6040776
Holy smokes, all of a sudden my random datasets are positively correlated with r>=0.6!
It's also significant.
> cor.test(g1,g2, method="pearson") Pearsons product-moment correlation data: g1 and g2 t = 5.3061, df = 49, p-value = 2.687e-06 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3941015 0.7541199 sample estimates: cor 0.6040776
So if you have used Pearson in large datasets, you will almost certainly have some of these spurious correlations in your data.
How can you solve this? By using Spearman, of course:
> cor(g1, g2, method="spearman") [1] -0.0961086 > cor.test(g1, g2, method="spearman") Spearmans rank correlation rho data: g1 and g2 S = 24224, p-value = 0.5012 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.0961086
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.