simstudy update: two new functions that generate correlated observations from non-normal distributions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy
with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen
and addCorGen
. Here’s a brief intro to these functions.
Generate data based on values from existing data set
addCorGen
allows us to create correlated data from an existing data set, as one can already do using addCorData
, but with non-normal data. In the case of addCorGen
, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable xbase
, which varies by id.
First we define the data and generate a data set:
def <- defData(varname = "xbase", formula = 5, variance = 0.2, dist = "gamma", id = "cid") def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase", dist="nonrandom", link = "log") def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase", dist="nonrandom", link = "logit") def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase", dist="nonrandom", link = "log") def <- defData(def, varname = "gammaDis", formula = 1, dist="nonrandom") dt <- genData(10000, def) dt ## cid xbase lambda p gammaMu gammaDis ## 1: 1 12.1128232 5.536056 0.8366960 18.588900 1 ## 2: 2 4.9148342 2.695230 0.3715554 4.405998 1 ## 3: 3 11.5550282 5.235712 0.8125261 16.626630 1 ## 4: 4 3.0802596 2.243475 0.2542785 3.052778 1 ## 5: 5 0.9767811 1.817893 0.1535577 2.004423 1 ## --- ## 9996: 9996 6.0564517 3.021173 0.4543613 5.536100 1 ## 9997: 9997 3.1298866 2.254636 0.2571119 3.083229 1 ## 9998: 9998 12.4642670 5.734076 0.8505956 19.942505 1 ## 9999: 9999 4.6559318 2.626345 0.3536072 4.183660 1 ## 10000: 10000 3.4314285 2.323658 0.2747666 3.274895 1
The Poisson distribution has a single parameter, lambda:
dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1, corstr = "cs", dist = "poisson", param1 = "lambda", cnames = "a, b, c") dtX1[, .(cid, xbase, lambda, a, b, c)] ## cid xbase lambda a b c ## 1: 1 12.1128232 5.536056 4 6 7 ## 2: 2 4.9148342 2.695230 2 4 1 ## 3: 3 11.5550282 5.235712 5 6 4 ## 4: 4 3.0802596 2.243475 1 3 1 ## 5: 5 0.9767811 1.817893 2 1 0 ## --- ## 9996: 9996 6.0564517 3.021173 1 3 3 ## 9997: 9997 3.1298866 2.254636 2 3 1 ## 9998: 9998 12.4642670 5.734076 4 6 8 ## 9999: 9999 4.6559318 2.626345 2 3 5 ## 10000: 10000 3.4314285 2.323658 0 0 3
The Bernoulli (binary) distribution has a single parameter, p:
dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "ar1", dist = "binary", param1 = "p") dtX2[, .(cid, xbase, p, V1, V2, V3, V4)] ## cid xbase p V1 V2 V3 V4 ## 1: 1 12.1128232 0.8366960 1 1 1 1 ## 2: 2 4.9148342 0.3715554 0 0 0 0 ## 3: 3 11.5550282 0.8125261 1 1 1 1 ## 4: 4 3.0802596 0.2542785 0 1 0 0 ## 5: 5 0.9767811 0.1535577 0 0 0 1 ## --- ## 9996: 9996 6.0564517 0.4543613 0 0 0 0 ## 9997: 9997 3.1298866 0.2571119 1 0 0 0 ## 9998: 9998 12.4642670 0.8505956 0 1 1 1 ## 9999: 9999 4.6559318 0.3536072 1 1 0 0 ## 10000: 10000 3.4314285 0.2747666 1 0 1 1
And here is the the Gamma distribution, with its two parameters (mean and dispersion):
dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4, corstr = "cs", dist = "gamma", param1 = "gammaMu", param2 = "gammaDis") dtX3[, .(cid, xbase, gammaMu, gammaDis, V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))] ## cid xbase gammaMu gammaDis V1 V2 V3 ## 1: 1 12.1128232 18.588900 1 11.24 3.44 9.11 ## 2: 2 4.9148342 4.405998 1 0.91 3.77 0.76 ## 3: 3 11.5550282 16.626630 1 68.47 12.91 1.72 ## 4: 4 3.0802596 3.052778 1 2.54 3.63 2.98 ## 5: 5 0.9767811 2.004423 1 0.39 0.14 0.42 ## --- ## 9996: 9996 6.0564517 5.536100 1 0.29 4.84 1.80 ## 9997: 9997 3.1298866 3.083229 1 4.81 0.38 0.81 ## 9998: 9998 12.4642670 19.942505 1 17.10 3.56 4.04 ## 9999: 9999 4.6559318 4.183660 1 1.17 0.21 1.47 ## 10000: 10000 3.4314285 3.274895 1 1.02 1.61 2.24
Long form data
If we have data in long form (e.g. longitudinal data), the function will recognize the structure:
def <- defData(varname = "xbase", formula = 5, variance = .4, dist = "gamma", id = "cid") def <- defData(def, "nperiods", formula = 3, dist = "noZeroPoisson") def2 <- defDataAdd(varname = "lambda", formula = "0.5 + 0.5 * period + 0.1 * xbase", dist="nonrandom", link = "log") dt <- genData(1000, def) dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3) dtLong <- addColumns(def2, dtLong) dtLong ## cid period xbase nperiods timeID lambda ## 1: 1 0 6.693980 1 1 3.220053 ## 2: 1 1 6.693980 1 2 5.308971 ## 3: 1 2 6.693980 1 3 8.753013 ## 4: 2 0 10.008645 2 4 4.485565 ## 5: 2 1 10.008645 2 5 7.395447 ## --- ## 2996: 999 1 6.753605 2 2996 5.340720 ## 2997: 999 2 6.753605 2 2997 8.805359 ## 2998: 1000 0 2.006781 4 2998 2.015119 ## 2999: 1000 1 2.006781 4 2999 3.322369 ## 3000: 1000 2 2.006781 4 3000 5.477661 ### Generate the data dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3, rho = .6, corstr = "cs", dist = "poisson", param1 = "lambda", cnames = "NewPois") dtX3 ## cid period xbase nperiods timeID lambda NewPois ## 1: 1 0 6.693980 1 1 3.220053 3 ## 2: 1 1 6.693980 1 2 5.308971 5 ## 3: 1 2 6.693980 1 3 8.753013 9 ## 4: 2 0 10.008645 2 4 4.485565 2 ## 5: 2 1 10.008645 2 5 7.395447 4 ## --- ## 2996: 999 1 6.753605 2 2996 5.340720 6 ## 2997: 999 2 6.753605 2 2997 8.805359 11 ## 2998: 1000 0 2.006781 4 2998 2.015119 2 ## 2999: 1000 1 2.006781 4 2999 3.322369 4 ## 3000: 1000 2 2.006781 4 3000 5.477661 7
We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:
geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid, family = poisson, corstr = "exchangeable") ## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27 ## running glm to get initial regression estimate ## (Intercept) period xbase ## 0.52045259 0.50354885 0.09746544 round(summary(geefit)$working.correlation, 2) ## [,1] [,2] [,3] ## [1,] 1.00 0.58 0.58 ## [2,] 0.58 1.00 0.58 ## [3,] 0.58 0.58 1.00
In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, let me know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.