Site icon R-bloggers

How to Generate Correlated Data in R

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes we need to generate correlated data for exhibition purposes, technical assessments, testing etc. We have provided a walk-through example of how to generate correlated data in Python using the scikit-learn library. In R, as far as I know, there is not any library that allows us to generate correlated data. For that reason, we will work with the simulated data from the Multivariate Normal Distribution. I would suggest having a look at the variance-covariance matrix and the relationship between correlation and covariance.

Generate Correlated Data

We will generate 1000 observations from the Multivariate Normal Distribution of 3 Gaussians as follows:

library(MASS)
library(tidyverse)
library(GGally)


set.seed(5)

# create the variance covariance matrix
sigma<-rbind(c(1,-0.8,-0.7), c(-0.8,1, 0.9), c(-0.7,0.9,1))

# create the mean vector
mu<-c(10, 5, 2) 

# generate the multivariate normal distribution
df<-as.data.frame(mvrnorm(n=1000, mu=mu, Sigma=sigma))

ggpairs(df)

As we can see, we generated the correlated data with the expected outcome in terms of mean, variance, and correlation.

Generate Categorical Correlated Data

In the case where we want to generate categorical data, we work in two steps. First, we generate the continuous correlated data as we did above, and then we transform it to categorical by creating bins.

Binary Variables

Let’s see how we can create a Binary variable taking values 0 and 1:

df<-df%>%mutate(MyBinary = ifelse(V1>median(V1), 1 ,0))

Binary Variables with Noise

In the above example, we set the value “1” when the V1 variable is greater than the median and “0” otherwise. Let’s say that we want to create more noise data, i.e. not fully correlated. Let’s say that we want to apply the following rule:

df<-df%>%mutate(MyNoisyBinary = ifelse(V1>median(V1), sample(c(0,1),n(), replace = TRUE, p=c(0.25, 0.75)) ,
                                       sample(c(0,1),n(), replace = TRUE, p=c(0.75, 0.25))))

Categorical Variables

Similarly, we can generate categorical variables of many levels. Let’s say that we want to create the age group from V1 with the following rules:

df<-df%>%mutate(AgeGroup= case_when(V1<quantile(V1,0.25)~"Group 1",
                                V1<quantile(V1,0.5)~"Group 2",
                                V1<quantile(V1,0.75)~"Group 3",
                                TRUE~"Group 4"))

Categorical Variables with Noise

Similarly, we can add some noise to our generated categorical variables as follows:

df<-df%>%mutate(MyNoisyCat= case_when(V1<quantile(V1,0.25)~sample(c("G1","G2","G3","G4"),n(), replace = TRUE, p=c(0.70, 0.1, 0.1, 0.1)),
                                    V1<quantile(V1,0.5)~sample(c("G1","G2","G3","G4"),n(), replace = TRUE, p=c(0.1, 0.7, 0.1, 0.1)),
                                    V1<quantile(V1,0.75)~sample(c("G1","G2","G3","G4"),n(), replace = TRUE, p=c(0.1, 0.1, 0.7, 0.1)),
                                    TRUE~sample(c("G1","G2","G3","G4"),n(), replace = TRUE, p=c(0.1, 0.1, 0.1, 0.7))))


ggpairs(df)

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.