[This article was first published on Adventures in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let me preface this post by saying I am getting frustrated with R. The syntax is not intuitive and the performance for matrix operations is slow. Using Octave, a free Matlab clone, I can get over 6 Gflops on things that R is doing at less than 2. After this post, I will focus on the statistical functions of R and will do any matrix operations in Octave, or maybe NumPy.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
First, let’s look at a situation where you find your self with a non-positive semi-definite (PSD) matrix. Remember that the default chol() requires a positive definite (PD) matrix and we created a function that can handle a PSD matrix. Imagine if you have 3 time series with different sampling intervals — in our example we will have one that samples every day, another that samples M-W-F-Sa-Su, and one that samples T-R (Thursday) – Sa. Further we only have a limited set of data – 3 weeks in our sample. To create a correlation matrix, we would be forced to either only consider Saturdays (where all 3 series intersect), or to compute the correlations pairwise. With only 3 intersection points, we would probably choose the pairwise method to attempt to get a better handle on the true underlying process.
We will use SAS IML to construct the series and PROC CORR from SAS to calculate the pairwise correlations.
proc iml;
< o:p>/*True covariance of the system*/
C1 = {1 .5 0,
< o:p> .5 1 .25,
< o:p> 0 .25 1};
< o:p>
r = root(C1)`;n=21;
< o:p>out = J(n,3,.);
/*Generate 21 days worth of data*/
do i=1 to n;
< o:p> hold = J(1,3);
< o:p> do j=1 to 3;
< o:p> hold[j] = rannor(2);
< o:p> end;
< o:p>
hold = r * hold`;
out[i,] = hold`;
end;< o:p>< o:p>
create out from out;
< o:p>append from out;
< o:p>close out;
< o:p>quit;
/*Filter out the unobserved values*/
data out;
< o:p>format date date9.;
< o:p>set out;
< o:p>retain date “04DEC2011″d;
< o:p>if _n_ ^=1 then date = date + 1;
< o:p>
if ^(weekday(date) in (2, 4, 6, 7, 1)) then col2 = .;
< o:p>
if ^(weekday(date) in (3, 5, 7)) then col3 = .;
< o:p>run;
title “Observed Data”;
< o:p>proc print data=out noobs;
< o:p>run;
title “Pairwise Correlation, observed”;
< o:p>proc corr data=out(drop=date)
< o:p> out=corr_out(where=(_type_ = “CORR”))
< o:p> ;run;
data corr_out;
< o:p>set corr_out(drop=_type_ _name_);
< o:p>run;
title “Correlation Eigenvalues”;
< o:p>proc iml;
< o:p>use corr_out;
< o:p>read all into corr;
< o:p>close corr_out;< o:p>< o:p> e = eigval(corr);< o:p> print e;
< o:p>quit;
< o:p>
Here is the output:
Observed Data< o:p> |
date< o:p> | COL1< o:p> | COL2< o:p> | COL3< o:p> |
04DEC2011< o:p> | 1.31183< o:p> | 0.08083< o:p> | .< o:p> |
05DEC2011< o:p> | 2.04198< o:p> | 2.26740< o:p> | .< o:p> |
06DEC2011< o:p> | 0.55341< o:p> | .< o:p> | -0.86957< o:p> |
07DEC2011< o:p> | -0.29515< o:p> | 0.40556< o:p> | .< o:p> |
08DEC2011< o:p> | -0.11603< o:p> | .< o:p> | 0.24319< o:p> |
09DEC2011< o:p> | 1.24763< o:p> | 0.92012< o:p> | .< o:p> |
10DEC2011< o:p> | 0.65947< o:p> | -0.30911< o:p> | -0.28241< o:p> |
11DEC2011< o:p> | 0.69131< o:p> | 0.72022< o:p> | .< o:p> |
12DEC2011< o:p> | 1.83849< o:p> | 0.31602< o:p> | .< o:p> |
13DEC2011< o:p> | -0.35605< o:p> | .< o:p> | -0.16218< o:p> |
14DEC2011< o:p> | -0.96497< o:p> | -2.40601< o:p> | .< o:p> |
15DEC2011< o:p> | -0.42313< o:p> | .< o:p> | -0.16670< o:p> |
16DEC2011< o:p> | 0.36461< o:p> | -0.44278< o:p> | .< o:p> |
17DEC2011< o:p> | -1.97964< o:p> | -0.47143< o:p> | 1.70729< o:p> |
18DEC2011< o:p> | 0.56940< o:p> | 0.68279< o:p> | .< o:p> |
19DEC2011< o:p> | -0.72632< o:p> | -1.30090< o:p> | .< o:p> |
20DEC2011< o:p> | 0.51960< o:p> | .< o:p> | -0.15246< o:p> |
21DEC2011< o:p> | -0.73427< o:p> | -0.13240< o:p> | .< o:p> |
22DEC2011< o:p> | 0.34385< o:p> | .< o:p> | -0.79128< o:p> |
23DEC2011< o:p> | 0.00112< o:p> | -1.70289< o:p> | .< o:p> |
24DEC2011< o:p> | -0.38676< o:p> | -0.74608< o:p> | -0.16191< o:p> |
Pairwise Correlation, observed< o:p> |
The CORR Procedure< o:p>
3 Variables:< o:p> | COL1 COL2 COL3< o:p> |
Simple Statistics< o:p> | ||||||
Variable< o:p> | N< o:p> | Mean< o:p> | Std Dev< o:p> | Sum< o:p> | Minimum< o:p> | Maximum< o:p> |
COL1< o:p> | 21< o:p> | 0.19811< o:p> | 0.96281< o:p> | 4.16039< o:p> | -1.97964< o:p> | 2.04198< o:p> |
COL2< o:p> | 15< o:p> | -0.14124< o:p> | 1.14684< o:p> | -2.11865< o:p> | -2.40601< o:p> | 2.26740< o:p> |
COL3< o:p> | 9< o:p> | -0.07067< o:p> | 0.74955< o:p> | -0.63602< o:p> | -0.86957< o:p> | 1.70729< o:p> |
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations< o:p> | ||||||||||||
COL1< o:p> | COL2< o:p> | COL3< o:p> | ||||||||||
COL1< o:p> |
|
|
| |||||||||
COL2< o:p> |
|
|
| |||||||||
COL3< o:p> |
|
|
|
Correlation Eigenvalues |
e |
---|
2.0606421 |
1.0896653 |
-0.150307 |
The Matrix package from R has a function called nearPD(). This function has some nice features such as requiring the diagonal to remain constant (useful for a covariance matrix) and a flag to make sure the output is still a correlation matrix. Not sure what the difference between those two would be if I used a correlation matrix as the input… The output also include the new eigenvalues, and the Frobenius Norm (a measure of closeness for matrices).
Rebonato and Jackel (1999) define 2 methods in their paper for finding a PSD matrix. The first is rather involved and the second, the Spectral Decomposition, is very compact. The Spectral Decomposition is what we will emulate. Take a second to go read that section (pp 7 – 9).
Here it is in R:
nearPSD = function(c){Using the example from the paper, we see this duplicates their results:
n = dim(c)[1]
e = eigen(c,sym=TRUE)
val = e$values * (e$values > 0)
vec = e$vectors
T = sqrt(diag( as.vector(1/(vec^2 %*% val)),n,n))
B = T %*% vec %*% diag(as.vector(sqrt(val)),n,n)
out = B %*% t(B)
return(out)
}
> c
[,1] [,2] [,3]
[1,] 1.0 0.9 0.7
[2,] 0.9 1.0 0.3
[3,] 0.7 0.3 1.0
> nearPSD(c)
[,1] [,2] [,3]
[1,] 1.0000000 0.8940244 0.6963191
[2,] 0.8940244 1.0000000 0.3009690
[3,] 0.6963191 0.3009690 1.0000000
The nearPD() function gives us something very close to the above:
> nearPD(c,corr=TRUE)
$mat
3 x 3 Matrix of class “dpoMatrix”
[,1] [,2] [,3]
[1,] 1.0000000 0.8945753 0.6966207
[2,] 0.8945753 1.0000000 0.3025436
[3,] 0.6966207 0.3025436 1.0000000
$eigenvalues
[1] 2.291883e+00 7.081174e-01 2.291883e-08
$corr
[1] TRUE
$normF
[1] 0.009727988
$iterations
[1] 13
$rel.tol
[1] 7.667644e-08
$converged
[1] TRUE
attr(,”class”)
[1] “nearPD”
So what is more efficient, called nearPD() and using the built in chol() function, or calling our nearPSD() and calling the my_chol() function from before?
n = 1000So the nearPSD method is much much faster. This is due to the fact that the nearPD method is iterative. However the “difference” between the output matrix is larger. Qualitatively, the nearPD drastically increases the correlation between 1 and 2 — from 0 to .79. The nearPSD keeps that correlation lower, .285, at the expense of the correlation between 1 and 2 and the rest of the values.
c = matrix(.9,n,n)
for( i in 1:n){
c[i,i] = 1
}
c[1,2] = 0
c[2,1] = 0
c[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 0.0 0.9 0.9 0.9
[2,] 0.0 1.0 0.9 0.9 0.9
[3,] 0.9 0.9 1.0 0.9 0.9
[4,] 0.9 0.9 0.9 1.0 0.9
[5,] 0.9 0.9 0.9 0.9 1.0
e = eigen(c,sym=TRUE)
min(e$values)
[1] -0.7982018
s = Sys.time()
pd = nearPD(c,corr=TRUE)
pd = as.matrix(pd$mat)
rpd = chol(pd)
e = Sys.time()
e – s
Time difference of 45.659 secs
pd[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.7934317 0.8984058 0.8984058 0.8984058
[2,] 0.7934317 1.0000000 0.8984058 0.8984058 0.8984058
[3,] 0.8984058 0.8984058 1.0000000 0.9000032 0.9000032
[4,] 0.8984058 0.8984058 0.9000032 1.0000000 0.9000032
[5,] 0.8984058 0.8984058 0.9000032 0.9000032 1.0000000
s = Sys.time()
psd = nearPSD(c)
rpsd = my_chol(psd,50)
e = Sys.time()
e – s
Time difference of 7.031 secs
psd[1:5,1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.2848481 0.7604250 0.7604250 0.7604250
[2,] 0.2848481 1.0000000 0.7604250 0.7604250 0.7604250
[3,] 0.7604250 0.7604250 1.0000000 0.9000002 0.9000002
[4,] 0.7604250 0.7604250 0.9000002 1.0000000 0.9000002
[5,] 0.7604250 0.7604250 0.9000002 0.9000002 1.0000000
norm(c – pd ,type=”F”)
[1] 1.126598
norm(c – psd,type=”F”)
[1] 8.827865
To leave a comment for the author, please follow the link and comment on their blog: Adventures in Statistical Computing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.