Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A data analysis surprise party.
Simple question
If I have correlation matrices each estimated with a month of daily returns, how much worse is the average of six of those compared to the estimate with six months of daily data?
Expected answer
Do a statistical bootstrap with the returns and compare the standard deviations across bootstrap samples for each correlation. This will show how much bigger the monthly standard deviations are compared to the daily standard deviations.
Return data
We use 126 daily (essentially 2012 H2) returns of 100 large cap US stocks. Figure 1 is a scatter plot of the 4950 correlations estimated with the two methods.
Figure 1: Correlations of returns estimated by mean monthly versus daily.
Surprise #1: Figure 1 is not boring.
Figure 2 compares the standard deviations of bootstraps of the two estimation methods.
Figure 2: Standard deviations across bootstrap samples of the monthly estimation versus daily estimation.
Surprise #2: The wrong estimator tends to be more variable.
Random normal data
We can experiment: Create new data shaped like our original data but that is randomly generated as multivariate normal with correlation equal to the daily estimate.
Figure 3 is like Figure 1 but for the normal data.
Figure 3: Correlation estimates: monthly versus daily for the randomly generated normal data.
Semi-surprise #1: Figure 3 is boring, unlike Figure 1.
We know that the bootstrap is appropriate for the normal data: the observations are independent and identically distributed — perfect for the bootstrap. Figure 4 is the normal data version of Figure 2.
Figure 4: Standard deviations across bootstrap samples of the monthly estimation versus daily estimation of the randomly generated normal data.
Now that we have fake data for which we actually know the true value of the correlations, we can look at the mean squared error of the estimates. This is done in Figure 5.
Figure 5: Mean squared errors (MSE) across bootstrap samples of the monthly estimation versus daily estimation of the randomly generated normal data.
Surprise #3: There’s hardly any difference at all between the quality of the two estimates for the normal data.
Back to reality
We can highlight the correlations where the two methods differ significantly, as shown in Figure 6.
Figure 6: Correlations estimated by mean monthly versus daily with strange points highlighted (in magenta).
Figure 7 shows where those highlighted points show up in the plot of bootstrap standard deviations.
Figure 7: Standard deviations across bootstrap samples of the monthly estimation versus daily estimation with correlation outliers highlighted.
Surprise #4: The disagreement over negative daily correlations does not explain the relatively low variability of the monthly estimates — at least not entirely.
Note that both standard deviations for the real data tend to be larger than for the normal data.
We can define weird the other way around: define extreme in terms of the bootstrap standard deviations and then see where those show up in the correlation scatterplot. This is Figures 8 and 9.
Figure 8: Standard deviations across bootstrap samples of the monthly estimation versus daily estimation with standard deviation outliers highlighted.
Figure 9: Correlations estimated by mean monthly versus daily with standard deviation outliers highlighted.
Summary
I don’t know what is going on. There is some sort of dependence that is showing up in the correlations of the returns. Part of the puzzle is probably that return correlations (and means) are not constant through time.
Epilogue
Complications in the air
Complications in the forecast now
Complication everywhere
from “Complications” by Steve Forbert
Appendix R
daily correlation estimate
Getting the correlation matrix using daily data is easy enough:
corDay <- cor(retmat6m)
mean monthly estimate
A little more goes into our monthly estimate. The first thing we do is create a three dimensional array to hold the six monthly estimates:
corMons <- array(NA, c(100, 100, 6), c(dimnames(corDay), list(NULL)))
Note that c
is used to put two lists together.
Then we compute the correlations in each month:
dseq <- 1:21 for(i in 1:6) { corMons[,,i] <- cor(retmat6m[dseq + (i-1) * 21,]) }
Finally, we average the correlations, ending up with a matrix:
corMM <- apply(corMons, 1:2, mean)
(There is a chapter in S Poetry about higher dimensional arrays that may be of use if you start working with them.)
scatter plot
We just want one copy of the off-diagonal values in the correlation matrix. We can get a logical vector selecting the lower triangle with:
clt <- lower.tri(corDay, diag=FALSE)
(The default value of diag
is FALSE
, but I typically include diag
in my calls to lower.tri
just to make sure I have it the right way around without looking.)
This is used like:
plot(corDay[clt], corMM[clt])
write a simple function
We can write a little function that does the mean monthly correlation estimate:
pp.cor6m <- function(x) { mc <- array(NA, c(ncol(x), ncol(x), 6)) dseq <- 1:21 for(i in 1:6) { mc[,,i] <- cor(x[dseq + (i-1) * 21,]) } apply(mc, 1:2, mean) }
bootstrap correlations
We start off again by creating arrays that we will then populate:
bootcord <- bootcorm <- array(NA, c(100, 100, 200))
Now do the actual bootstrapping:
for(i in 1:200) { bx <- retmat6m[sample(126, 126, replace=TRUE),] bootcord[,,i] <- cor(bx) bootcorm[,,i] <- pp.cor6m(bx) }
Finally, get the standard deviation across the bootstrap estimates for each correlation:
bootd.sd <- apply(bootcord, 1:2, sd) bootm.sd <- apply(bootcorm, 1:2, sd)
generate normal data
There is a function in the MASS
package for generating random multivariate normal data.
require(MASS)
This is used like:
normmat <- mvrnorm(126, mu=rep(0,100), Sigma=corDay)
bootstrap MSE
The bootstrapping is:
nbtcord <- nbtcorm <- array(NA, c(100, 100, 200)) for(i in 1:200) { bx <- normmat[sample(126, 126, replace=TRUE),] nbtcord[,,i] <- cor(bx) nbtcorm[,,i] <- pp.cor6m(bx) }
Then the mean squared errors are calculated like:
nbtd.mse <- apply(nbtcord - as.vector(corDay), 1:2, function(x) mean(x^2))
The as.vector
is used because there is an error otherwise. When arithmetic is done on arrays in R, there is a check to see if they are conformable. A matrix and a three dimensional array are not conformable. Removing the matrixness of corDay
allows the computation to proceed, and it is the right computation in this case.
define strange
When looking at Figure 1 we have a sense of which points are weird. There are multiple ways to translate our intuition into a definition. Here is the one I used:
corMM[clt] - corDay[clt] > .1 & corDay[clt] < 0
That is, the monthly estimate needs to be at least 0.1 bigger than the daily estimate and the daily estimate must be negative.
The definitions of weird in the other direction are:
wei <- bootd.sd[clt] - bootm.sd[clt] > .02 wei2 <- bootd.sd[clt] - bootm.sd[clt] > .04
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.