Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Gaussian vector is extremely interesting since it remains Gaussian when conditioning. More precisely, if
where the covariance matrix of the random vector was
i.e. it is the Schur complement of bloc
It is possible to derive explicitly the correlation of
> library(mnormt) > SIGMA=matrix(c(1,.6,.7,.6,1,.8,.7,.8,1),3,3) > SIGMA11=SIGMA[1:2,1:2] > SIGMA21=SIGMA[3,1:2] > SIGMA12=SIGMA[1:2,3] > SIGMA22=SIGMA[3,3] > S=SIGMA11-SIGMA12%*%solve(SIGMA22)%*%SIGMA21 > S[1,2]/sqrt(S[1,1]*S[2,2]) [1] 0.09335201
This concept is discussed in a paper on Conditional Correlation by Kunihiro Baba, Ritei Shibata and Masaaki Sibuya. But what about statistical inference ? In order to see what happens, let us run some simulations, and condition with
> k=10 > correlV=rep(0,k) > X=rmnorm(n=10000000,varcov=SIGMA) > QZ=qnorm((0:k)/k) > ZQ=cut(X[,3],QZ) > LZ=levels(ZQ) > correl=rep(NA,k) > for(i in 1:k){ + correl[i]=cor(X[ZQ==LZ[i],1:2])[1,2] + } > barplot(correl,names.arg=LZ,col="light blue",ylim=range(correl))
or some percentiles,
> k=100
In the middle it looks fine (we can look at it more carefully and fit a spline curve), but for very large values of
Note that for all estimates (each bar), there are exactly the same number of observations since when conditioning, we keep only observations such that
(up to 1 million simulations, i.e. 100,000 conditional observations used to estimate the correlation). If we look at the third decile, it converges,
for the second one, it might be a bit slower,
but for the first decile… either it is extremely slow, or it has a strong bias. But I don’t get why…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.