Third, and Hopefully Final, Post on Correlated Random Normal Generation (Cholesky Edition)

[This article was first published on Cerebral Mastication » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

André-Louis Cholesky is my homeboy

When I did a brief post three days ago I had no plans on writing two more posts on correlated random number generation. But I’ve gotten a couple of emails, a few comments, and some Twitter feedback. In response to my first post, Gappy, calls me out and says, “the way mensches do multivariate (log)normal variates is via Cholesky. It’s simple, instructive, and fast.”  And I think we’re all smart enough to read through Mr. Gappy’s comment and see that he’s saying I’m a complicated, opaque, and slow, גוי‎. My wife called and said his list would be more accurate if he added ‘emotionally detached.’ I have no idea what she means.

At any rate, in response to Gappy’s comment, here is the third verse (same as the first). The crux of the change is the following lines:


# shift the mean of ourData to zero
ourData0 <- as.data.frame(sweep(ourData,2,colMeans(ourData),"-"))

#Cholesky Decomposition of the covariance matrix
C <- chol(nearPD(cov(ourData0))$mat)

#create a matrix of random standard normals
Z <- matrix(rnorm(n * ncol(ourData)), ncol(ourData))

#multiply the standard normals by the transpose of the Cholesky
X <- t(C) %*% Z

myDraws <- data.frame(as.matrix(t(X)))
names(myDraws) <- names(ourData)

# we still need to shift the means of the samples.

# shift the mean of the draws over to match the starting data
myDraws <- as.data.frame(sweep(myDraws,2,colMeans(ourData),"+"))

Edit: When I first publishes this example, I didn’t shift the means prior to taking the cov(). I’ve sense corrected that.  Also thanks to @fdaapproved on Twitter who pointed out that I can replace the loop above with myDraws <- as.data.frame(sweep(t(X),2,colMeans(ourData),”+”))

This method, which uses Cholesky decomposition, is how I initially learned to create correlated random draws. I think this method is comparable to the mvrnorm() method. mvrnorm() is handy because it wraps everything above in one single line of code. But the above method is reliant only on the Matrix package and that’s only for the nearPD() function. If you are familiar with the guts of the mvrnorm() function and the chol() function, I’d love for you to comment on any technical differences. I looked briefly at the code for both and quickly realized my matrix math was rusty enough that it was going to take a while for me to sort through the code.

If you want the whole script you can find it embedded below and on Github.

# multivariate normal example using Cholesky decomposition
require(Matrix)
#make this reproducible
set.seed(2)
#how many draws in our starting data set?
n <- 1e4
# how many draws do we want from this distribution?
drawCount <- 1e4
myData <- rnorm(n, 5, .6)
yourData <- myData + rnorm(n, 8, .25)
hisData <- myData + rnorm(n, 6, .4)
herData <- hisData + rnorm(n, 8, .35)
ourData <- data.frame(myData, yourData, hisData, herData)
# now we have raw correlations in the 70s, 80s, and 90s. Funny how that
# works out
cor(ourData)
ourData0 <- ourData
# shift the mean of ourData to zero
ourData0 <- as.data.frame(sweep(ourData,2,colMeans(ourData),"-"))
#Cholesky Decomposition of the covariance matrix
C <- chol(nearPD(cov(ourData0))$mat)
#create a matrix of random standard normals
Z <- matrix(rnorm(n * ncol(ourData)), ncol(ourData))
#multiply the standard normals by the transpose of the Cholesky
X <- t(C) %*% Z
#look at the covariances
cov(as.matrix(t(X)))
cov(ourData0)
#woot!
myDraws <- data.frame(as.matrix(t(X)))
names(myDraws) <- names(ourData)
# this method handles the covariance (correlation and standard deviation)
# but we still need to shift the means of the samples.
# shift the mean of the draws over to match the starting data
myDraws <- as.data.frame(sweep(myDraws,2,colMeans(ourData),"+"))
#check the mean and sd
apply(myDraws, 2, mean)
apply(myDraws, 2, sd)
#let's look at the mean and standard dev of the starting data
apply(ourData, 2, mean)
apply(ourData, 2, sd)
# so myDraws contains the final draws
# let's check Kolmogorov-Smirnov between the starting data
# and the final draws
for (i in 1:ncol(ourData)){
print(ks.test(myDraws[[i]], ourData[[i]]))
}
#look at the correlation matrices
cor(myDraws)
cor(ourData)
#it's fun to plot the variables and see if the PDFs line up
#It's a good sanity check. Using ggplot2 to plot
require(ggplot2)
# rearrange the data to be "tall" not "wide"
meltDraws <-melt(myDraws)
meltDraws$dataSource <- "simulated"
meltData <- melt(ourData)
meltData$dataSource <- "original"
plotData <- rbind(meltData, meltDraws)
qplot(value, colour=dataSource, data=plotData, geom="density")+ facet_wrap(~variable)

To leave a comment for the author, please follow the link and comment on their blog: Cerebral Mastication » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)