Correlation is not transitive, in general at least
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Update Aug 10, 2019: I wrote a new blog post about the same as below but using a simulation approach.
Update Aug 27, 2019: Minor change in how equations are solved (from version 0.9.0.9122
).
Let ρXY be the correlation between the stochastic variables X and Y and similarly for ρXZ and ρYZ. If we know two of these, can we say anything about the third?
Yes, sometimes, but not always. Say we have ρXZ and ρYZ and they are both positive. Intuition would then make us believe that ρXY is probably also positive then. But that is not always the case.
To see why, we need the partial correlation of X and Y given Z:
ρXY∣Z=ρXY−ρXZρYZ√1−ρ2XZ√1−ρ2YZ
The partial correlation between X and Y given Z means the correlation between X and Y with the effect of Z removed. Let us make a small example using the mtcars
data:
x <- mtcars$mpg y <- mtcars$drat z <- mtcars$wt cor(x, y) ## [1] 0.6811719 x_without_z <- residuals(lm(x ~ z)) y_without_z <- residuals(lm(y ~ z)) cor(x_without_z, y_without_z) ## [1] 0.1806278 (cor(x, y) - cor(x, z)*cor(y, z))/(sqrt(1 - cor(x, z)^2)*sqrt(1 - cor(y, z)^2)) ## [1] 0.1806278
So the correlation between mpg
(miles per gallon) and drat
(rear axle ratio) is 0.6811719.
And the partial correlation between mpg
and draw
given wt
(weight in 1000 lbs) is 0.1806278.
We can rearrange the formula using Ryacas
(see e.g. this blog post, note that it I’m using the development version that is currently only available via GitHub):
library(Ryacas) pcor <- "rXYZ == (rXY - rXZ*rYZ)/(Sqrt(1 - rXZ^2)*Sqrt(1 - rYZ^2))" rXY <- pcor %>% y_fn("Solve", "rXY") %>% yac_str() rXY ## [1] "{rXY==(rXYZ-(-rXZ*rYZ)/(Sqrt(1-rXZ^2)*Sqrt(1-rYZ^2)))*Sqrt(1-rXZ^2)*Sqrt(1-rYZ^2)}" rXY %>% y_rmvars() %>% yac_str() ## [1] "{(rXYZ-(-rXZ*rYZ)/(Sqrt(1-rXZ^2)*Sqrt(1-rYZ^2)))*Sqrt(1-rXZ^2)*Sqrt(1-rYZ^2)}" rXY %>% y_fn("TeXForm") %>% yac_str() ## [1] "\\left( \\mathrm{ rXY } = \\left( \\mathrm{ rXYZ } - \\frac{ - \\mathrm{ rXZ } \\mathrm{ rYZ }}{\\sqrt{1 - \\mathrm{ rXZ } ^{2}} \\sqrt{1 - \\mathrm{ rYZ } ^{2}}} \\right) \\sqrt{1 - \\mathrm{ rXZ } ^{2}} \\sqrt{1 - \\mathrm{ rYZ } ^{2}}\\right) "
ρXY=(ρXY∣Z−−ρXZρYZ√1−ρ2XZ√1−ρ2YZ)√1−ρ2XZ√1−ρ2YZ=ρXY∣Z√1−ρ2XZ√1−ρ2YZ+ρXZρYZ
The partial correlation takes values between −1 and 1 so ρXY must lie in the interval
ρXZρYZ±√1−ρ2XZ√1−ρ2YZ.
Going back to the mtcars
example:
rXZ <- cor(x, z) rYZ <- cor(y, z) rXZ*rYZ + c(-1, 1)*sqrt(1 - rXZ^2)*sqrt(1 - rYZ^2) ## [1] 0.2692831 0.9670285 cor(x, y) ## [1] 0.6811719
We can illustrate this
library(tidyverse) grid <- expand.grid( rXZ = seq(-1, 1, len = 100), rYZ = seq(-1, 1, len = 100)) d <- grid %>% as_tibble() %>% mutate( rXY_l = rXZ*rYZ - sqrt(1 - rXZ^2)*sqrt(1 - rYZ^2), rXY_u = rXZ*rYZ + sqrt(1 - rXZ^2)*sqrt(1 - rYZ^2), ) %>% mutate(rXY_sign = case_when( rXY_l < 0 & rXY_u < 0 ~ "-1", rXY_l > 0 & rXY_u > 0 ~ "1", TRUE ~ "?" )) p <- ggplot(d, aes(rXZ, rYZ, fill = rXY_sign)) + geom_tile() + scale_fill_manual(expression(paste("Sign of ", rho[XY])), values = c("-1" = "red", "1" = "blue", "?" = "grey")) + labs(x = expression(rho[XZ]), y = expression(rho[YZ])) + theme_bw() + coord_equal() p
Interesting! We actually just found the circle with radius 1!
p + annotate("path", x = cos(seq(0, 2*pi, length.out = 100)), y = sin(seq(0, 2*pi, length.out = 100)), color = "green", size = 2)
We can also argue mathematically: Before we found that ρXY must lie in the interval
ρXZρYZ±√1−ρ2XZ√1−ρ2YZ.
If ρXY must be positive, then we must have that both limits are greater than zero, namely that
ρXZρYZ−√1−ρ2XZ√1−ρ2YZ>0ρXZρYZ+√1−ρ2XZ√1−ρ2YZ>0.
In other words, if we have two known correlations, ρXZ and ρYZ, then
sign(ρXY)={sign(ρXZ)sign(ρYZ)ρ2XZ+ρ2YZ>1not knownρ2XZ+ρ2YZ≤1.
Again, going back to the mtcars
example:
rXZ <- cor(x, z) rYZ <- cor(y, z) rXZ^2 + rYZ^2 ## [1] 1.260404
So the sign of ρXY can be known and is positive because
sign(rXZ) * sign(rYZ) ## [1] 1
This was also known from calculating the limits earlier.
Terence Tao has a mathematical treatment of the topic in his blog post “When is correlation transitive?”.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.