Even Simpler Multivariate Correlated Simulations
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
So after yesterday’s post on Simple Simulation using Copulas I got a very nice email that basically begged the question, “Dude, why are you making this so hard?” The author pointed out that if what I really want is a Gaussian correlation structure for Gaussian distributions then I could simply use the mvrnorm() function from the MASS package. Well I did a quick
?mvrnorm
and, I’ll be damned, he’s right! The advantage of using a copula is the ability to simulate correlation structures where the correlation is different for different levels of values. So that gives the flexibility to make the tails of the distributions more correlated, for example. But my example yesterday was purposefully simple… so simple that a copula was not even needed.
After creating my sample data all I really needed to do was this:
myDraws <- mvrnorm(1e5, mu=mean(ourData), Sigma=cov(ourData))
So I took my example from yesterday and updated it using the mvrnorm() code and, as is my custom, put up a Github gist. The code is embedded below as well. I added a little ggplot2 code at the end that will create a facet plot of the 4 distributions showing the shape of the distributions of both the starting data and the simulated data. The plot in the upper left of this post is the ggplot output.
EDIT: The email hipping me to this was sent by Dirk Eddelbuettel who’s been very helpful to me more times than I can count. I had omitted his name initially. However after confirming with Dirk, he told me it was OK to mention him by name in this post.
# multivariate normal example using the MASS package | |
require(MASS) | |
#make this reproducible | |
set.seed(2) | |
#how many draws in our starting data set? | |
n <- 1e4 | |
# how many draws do we want from this distribution? | |
drawCount <- 1e4 | |
myData <- rnorm(n, 5, .6) | |
yourData <- myData + rnorm(n, 8, .25) | |
hisData <- myData + rnorm(n, 6, .4) | |
herData <- hisData + rnorm(n, 8, .35) | |
ourData <- data.frame(myData, yourData, hisData, herData) | |
# now we have raw correlations in the 70s, 80s, and 90s. Funny how that | |
# works out | |
cor(ourData) | |
#build the mvrnorm and take draws | |
#this replaces all my normalizing and | |
#copula building in the previous example | |
myDraws <- mvrnorm(1e5, mu=mean(ourData), | |
Sigma=cov(ourData) | |
) | |
myDraws <- data.frame(myDraws) | |
#check the mean and sd | |
apply(myDraws, 2, mean) | |
apply(myDraws, 2, sd) | |
#let's look at the mean and standard dev of the starting data | |
apply(ourData, 2, mean) | |
apply(ourData, 2, sd) | |
# so myDraws contains the final draws | |
# let's check Kolmogorov-Smirnov between the starting data | |
# and the final draws | |
for (i in 1:ncol(ourData)){ | |
print(ks.test(myDraws[[i]], ourData[[i]])) | |
} | |
#look at the correlation matrices | |
cor(myDraws) | |
cor(ourData) | |
#it's fun to plot the variables and see if the PDFs line up | |
#It's a good sanity check. Using ggplot2 to plot | |
require(ggplot2) | |
# rearrange the data to be "tall" not "wide" | |
meltDraws <-melt(myDraws) | |
meltDraws$dataSource <- "simulated" | |
meltData <- melt(ourData) | |
meltData$dataSource <- "original" | |
plotData <- rbind(meltData, meltDraws) | |
qplot(value, colour=dataSource, data=plotData, geom="density")+ facet_wrap(~variable) | |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.