simstudy update: two new functions that generate correlated observations from non-normal distributions

Keith Goldfeld

5 years ago

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen and addCorGen. Here’s a brief intro to these functions.

Generate generally correlated data

genCorGen is an extension of genCorData, which was provided in earlier versions of simstudy to generate multivariate normal data. In the first example below, we are generating data from a multivariate Poisson distribution. To do this, we need to specify the mean of the Poisson distribution for each new variable, and then we specify the correlation structure, just as we did with the normal distribution.

l <- c(8, 10, 12) # lambda for each new variable

dp <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson", 
                rho = 0.3, corstr = "cs", wide = TRUE)
dp
##         id V1 V2 V3
##    1:    1  7 13 12
##    2:    2  7 11 13
##    3:    3  7  8 14
##    4:    4  7 12  9
##    5:    5  8 13 18
##   ---              
##  996:  996  8 14 15
##  997:  997 10  5 11
##  998:  998  4  9  9
##  999:  999  5 10  9
## 1000: 1000  6 12 17

Here is the the estimated correlation (we would expect an estimate close to 0.3):

round(cor(as.matrix(dp[, .(V1, V2, V3)])), 2)
##      V1   V2   V3
## V1 1.00 0.29 0.26
## V2 0.29 1.00 0.31
## V3 0.26 0.31 1.00

Similarly, we can generate correlated binary data by specifying the probabilities:

db<- genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary", 
          rho = 0.8, corstr = "cs", wide = TRUE)
db
##         id V1 V2 V3
##    1:    1  1  1  1
##    2:    2  0  0  1
##    3:    3  1  1  1
##    4:    4  0  0  0
##    5:    5  1  1  1
##   ---              
##  996:  996  0  1  1
##  997:  997  0  0  0
##  998:  998  0  1  1
##  999:  999  1  1  1
## 1000: 1000  0  0  0

In the case of the binary outcome, the observed correlation will be lower that what is specified, which in this case was 0.8. I tried to provide some intuition about this in the earlier post:

round(cor(as.matrix(db[, .(V1, V2, V3)])), 2)
##     V1   V2   V3
## V1 1.0 0.50 0.40
## V2 0.5 1.00 0.56
## V3 0.4 0.56 1.00

The gamma distribution requires two parameters – the mean and dispersion. (These are converted into shape and rate parameters more commonly used.)

dg <- genCorGen(1000, nvars = 3, params1 = c(3,5,7), params2 = c(1,1,1),
                dist = "gamma", rho = .7, corstr = "cs", 
                wide = TRUE, cnames="a, b, c")
dg
##         id         a          b         c
##    1:    1 0.1957971  0.9902398  2.299307
##    2:    2 0.2566630  2.4271728  1.217599
##    3:    3 1.9550985 13.9248696  5.178042
##    4:    4 3.5525418  2.5711661  7.848605
##    5:    5 6.6981281  8.7494117 12.478329
##   ---                                    
##  996:  996 2.2059693  6.3474811  3.054551
##  997:  997 2.3571427  7.7841085  7.887417
##  998:  998 5.5326638  7.3273337 15.965228
##  999:  999 5.6284681 13.3574118 17.215722
## 1000: 1000 0.3749373  1.1480452  0.696243
round(cor(as.matrix(dg[, .(a, b, c)])), 2)
##      a    b    c
## a 1.00 0.65 0.67
## b 0.65 1.00 0.62
## c 0.67 0.62 1.00

These data sets can be generated in either wide or long form. So far, we have generated wide form data, where there is one row per unique id. The long form, where the correlated data are on different rows, is useful for plotting or fitting models, because there are repeated measurements for each id:

dgl <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), 
                 dist = "gamma", rho = .7, corstr = "cs", wide = FALSE, 
                 cnames="NewCol")
dgl
##         id period    NewCol
##    1:    1      0  1.066558
##    2:    1      1  5.666802
##    3:    1      2  5.366408
##    4:    2      0  1.419593
##    5:    2      1  9.318227
##   ---                      
## 2996:  999      1 21.821011
## 2997:  999      2 21.800972
## 2998: 1000      0 12.082063
## 2999: 1000      1 18.541231
## 3000: 1000      2 12.063846

Here is a plot of a subset of the data:

ids <- sample(1000,50, replace = FALSE)
ggplot(data=dgl[id %in% ids,], aes(x=factor(period), y=NewCol, group=id)) +
  geom_line(aes(color=factor(id)))+
  theme(legend.position = "none") +
  scale_x_discrete(expand = c(0,0.1))

Generate data based on values from existing data set

addCorGen allows us to create correlated data from an existing data set, as one can already do using addCorData, but with non-normal data. In the case of addCorGen, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable xbase, which varies by id.

First we define the data and generate a data set:

def <- defData(varname = "xbase", formula = 5, variance = 0.2, 
               dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase", 
               dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase", 
               dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase", 
               dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1, 
               dist="nonrandom")

dt <- genData(10000, def)
dt
##          cid      xbase   lambda         p   gammaMu gammaDis
##     1:     1 12.1128232 5.536056 0.8366960 18.588900        1
##     2:     2  4.9148342 2.695230 0.3715554  4.405998        1
##     3:     3 11.5550282 5.235712 0.8125261 16.626630        1
##     4:     4  3.0802596 2.243475 0.2542785  3.052778        1
##     5:     5  0.9767811 1.817893 0.1535577  2.004423        1
##    ---                                                       
##  9996:  9996  6.0564517 3.021173 0.4543613  5.536100        1
##  9997:  9997  3.1298866 2.254636 0.2571119  3.083229        1
##  9998:  9998 12.4642670 5.734076 0.8505956 19.942505        1
##  9999:  9999  4.6559318 2.626345 0.3536072  4.183660        1
## 10000: 10000  3.4314285 2.323658 0.2747666  3.274895        1

The Poisson distribution has a single parameter, lambda:

dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1, 
                  corstr = "cs", dist = "poisson", param1 = "lambda", 
                  cnames = "a, b, c")

dtX1[, .(cid, xbase, lambda, a, b, c)]
##          cid      xbase   lambda a b c
##     1:     1 12.1128232 5.536056 4 6 7
##     2:     2  4.9148342 2.695230 2 4 1
##     3:     3 11.5550282 5.235712 5 6 4
##     4:     4  3.0802596 2.243475 1 3 1
##     5:     5  0.9767811 1.817893 2 1 0
##    ---                                
##  9996:  9996  6.0564517 3.021173 1 3 3
##  9997:  9997  3.1298866 2.254636 2 3 1
##  9998:  9998 12.4642670 5.734076 4 6 8
##  9999:  9999  4.6559318 2.626345 2 3 5
## 10000: 10000  3.4314285 2.323658 0 0 3

The Bernoulli (binary) distribution has a single parameter, p:

dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, 
                  corstr = "ar1", dist = "binary", param1 = "p")

dtX2[, .(cid, xbase, p, V1, V2, V3, V4)]
##          cid      xbase         p V1 V2 V3 V4
##     1:     1 12.1128232 0.8366960  1  1  1  1
##     2:     2  4.9148342 0.3715554  0  0  0  0
##     3:     3 11.5550282 0.8125261  1  1  1  1
##     4:     4  3.0802596 0.2542785  0  1  0  0
##     5:     5  0.9767811 0.1535577  0  0  0  1
##    ---                                       
##  9996:  9996  6.0564517 0.4543613  0  0  0  0
##  9997:  9997  3.1298866 0.2571119  1  0  0  0
##  9998:  9998 12.4642670 0.8505956  0  1  1  1
##  9999:  9999  4.6559318 0.3536072  1  1  0  0
## 10000: 10000  3.4314285 0.2747666  1  0  1  1

And here is the the Gamma distribution, with its two parameters (mean and dispersion):

dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4, 
                  corstr = "cs", dist = "gamma", 
                  param1 = "gammaMu", param2 = "gammaDis")

dtX3[, .(cid, xbase, gammaMu, gammaDis, 
         V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))]
##          cid      xbase   gammaMu gammaDis    V1    V2   V3
##     1:     1 12.1128232 18.588900        1 11.24  3.44 9.11
##     2:     2  4.9148342  4.405998        1  0.91  3.77 0.76
##     3:     3 11.5550282 16.626630        1 68.47 12.91 1.72
##     4:     4  3.0802596  3.052778        1  2.54  3.63 2.98
##     5:     5  0.9767811  2.004423        1  0.39  0.14 0.42
##    ---                                                     
##  9996:  9996  6.0564517  5.536100        1  0.29  4.84 1.80
##  9997:  9997  3.1298866  3.083229        1  4.81  0.38 0.81
##  9998:  9998 12.4642670 19.942505        1 17.10  3.56 4.04
##  9999:  9999  4.6559318  4.183660        1  1.17  0.21 1.47
## 10000: 10000  3.4314285  3.274895        1  1.02  1.61 2.24

Long form data

If we have data in long form (e.g. longitudinal data), the function will recognize the structure:

def <- defData(varname = "xbase", formula = 5, variance = .4, 
               dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3, 
               dist = "noZeroPoisson")

def2 <- defDataAdd(varname = "lambda", 
                   formula = "0.5 + 0.5 * period + 0.1 * xbase", 
                   dist="nonrandom", link = "log")

dt <- genData(1000, def)

dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)

dtLong
##        cid period     xbase nperiods timeID   lambda
##    1:    1      0  6.693980        1      1 3.220053
##    2:    1      1  6.693980        1      2 5.308971
##    3:    1      2  6.693980        1      3 8.753013
##    4:    2      0 10.008645        2      4 4.485565
##    5:    2      1 10.008645        2      5 7.395447
##   ---                                               
## 2996:  999      1  6.753605        2   2996 5.340720
## 2997:  999      2  6.753605        2   2997 8.805359
## 2998: 1000      0  2.006781        4   2998 2.015119
## 2999: 1000      1  2.006781        4   2999 3.322369
## 3000: 1000      2  2.006781        4   3000 5.477661
### Generate the data 

dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3, 
                  rho = .6, corstr = "cs", dist = "poisson", 
                  param1 = "lambda", cnames = "NewPois")
dtX3
##        cid period     xbase nperiods timeID   lambda NewPois
##    1:    1      0  6.693980        1      1 3.220053       3
##    2:    1      1  6.693980        1      2 5.308971       5
##    3:    1      2  6.693980        1      3 8.753013       9
##    4:    2      0 10.008645        2      4 4.485565       2
##    5:    2      1 10.008645        2      5 7.395447       4
##   ---                                                       
## 2996:  999      1  6.753605        2   2996 5.340720       6
## 2997:  999      2  6.753605        2   2997 8.805359      11
## 2998: 1000      0  2.006781        4   2998 2.015119       2
## 2999: 1000      1  2.006781        4   2999 3.322369       4
## 3000: 1000      2  2.006781        4   3000 5.477661       7

We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:

geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid, 
              family = poisson, corstr = "exchangeable")
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
## (Intercept)      period       xbase 
##  0.52045259  0.50354885  0.09746544
round(summary(geefit)$working.correlation, 2)
##      [,1] [,2] [,3]
## [1,] 1.00 0.58 0.58
## [2,] 0.58 1.00 0.58
## [3,] 0.58 0.58 1.00

In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, let me know.

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.