Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In an earlier post, I described in a fair amount of detail an algorithm to generate correlated binary or Poisson data. I mentioned that I would be updating simstudy
with functions that would make generating these kind of data relatively painless. Well, I have managed to do that, and the updated package (version 0.1.3) is available for download from CRAN. There are now two additional functions to facilitate the generation of correlated data from binomial, poisson, gamma, and uniform distributions: genCorGen
and addCorGen
. Here’s a brief intro to these functions.
Generate generally correlated data
genCorGen
is an extension of genCorData
, which was provided in earlier versions of simstudy
to generate multivariate normal data. In the first example below, we are generating data from a multivariate Poisson distribution. To do this, we need to specify the mean of the Poisson distribution for each new variable, and then we specify the correlation structure, just as we did with the normal distribution.
l <- c(8, 10, 12) # lambda for each new variable dp <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson", rho = 0.3, corstr = "cs", wide = TRUE) dp ## id V1 V2 V3 ## 1: 1 7 13 12 ## 2: 2 7 11 13 ## 3: 3 7 8 14 ## 4: 4 7 12 9 ## 5: 5 8 13 18 ## --- ## 996: 996 8 14 15 ## 997: 997 10 5 11 ## 998: 998 4 9 9 ## 999: 999 5 10 9 ## 1000: 1000 6 12 17
Here is the the estimated correlation (we would expect an estimate close to 0.3):
round(cor(as.matrix(dp[, .(V1, V2, V3)])), 2) ## V1 V2 V3 ## V1 1.00 0.29 0.26 ## V2 0.29 1.00 0.31 ## V3 0.26 0.31 1.00
Similarly, we can generate correlated binary data by specifying the probabilities:
db<- genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary", rho = 0.8, corstr = "cs", wide = TRUE) db ## id V1 V2 V3 ## 1: 1 1 1 1 ## 2: 2 0 0 1 ## 3: 3 1 1 1 ## 4: 4 0 0 0 ## 5: 5 1 1 1 ## --- ## 996: 996 0 1 1 ## 997: 997 0 0 0 ## 998: 998 0 1 1 ## 999: 999 1 1 1 ## 1000: 1000 0 0 0
In the case of the binary outcome, the observed correlation will be lower that what is specified, which in this case was 0.8. I tried to provide some intuition about this in the earlier post:
round(cor(as.matrix(db[, .(V1, V2, V3)])), 2) ## V1 V2 V3 ## V1 1.0 0.50 0.40 ## V2 0.5 1.00 0.56 ## V3 0.4 0.56 1.00
The gamma distribution requires two parameters – the mean and dispersion. (These are converted into shape and rate parameters more commonly used.)
dg <- genCorGen(1000, nvars = 3, params1 = c(3,5,7), params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = TRUE, cnames="a, b, c") dg ## id a b c ## 1: 1 0.1957971 0.9902398 2.299307 ## 2: 2 0.2566630 2.4271728 1.217599 ## 3: 3 1.9550985 13.9248696 5.178042 ## 4: 4 3.5525418 2.5711661 7.848605 ## 5: 5 6.6981281 8.7494117 12.478329 ## --- ## 996: 996 2.2059693 6.3474811 3.054551 ## 997: 997 2.3571427 7.7841085 7.887417 ## 998: 998 5.5326638 7.3273337 15.965228 ## 999: 999 5.6284681 13.3574118 17.215722 ## 1000: 1000 0.3749373 1.1480452 0.696243 round(cor(as.matrix(dg[, .(a, b, c)])), 2) ## a b c ## a 1.00 0.65 0.67 ## b 0.65 1.00 0.62 ## c 0.67 0.62 1.00
These data sets can be generated in either wide or long form. So far, we have generated wide form data, where there is one row per unique id. The long form, where the correlated data are on different rows, is useful for plotting or fitting models, because there are repeated measurements for each id:
dgl <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = FALSE, cnames="NewCol") dgl ## id period NewCol ## 1: 1 0 1.066558 ## 2: 1 1 5.666802 ## 3: 1 2 5.366408 ## 4: 2 0 1.419593 ## 5: 2 1 9.318227 ## --- ## 2996: 999 1 21.821011 ## 2997: 999 2 21.800972 ## 2998: 1000 0 12.082063 ## 2999: 1000 1 18.541231 ## 3000: 1000 2 12.063846
Here is a plot of a subset of the data:
ids <- sample(1000,50, replace = FALSE) ggplot(data=dgl[id %in% ids,], aes(x=factor(period), y=NewCol, group=id)) + geom_line(aes(color=factor(id)))+ theme(legend.position = "none") + scale_x_discrete(expand = c(0,0.1))
Generate data based on values from existing data set
addCorGen
allows us to create correlated data from an existing data set, as one can already do using addCorData
, but with non-normal data. In the case of addCorGen
, the parameter(s) used to define the distribution is a field (or fields) in the data set. The correlated data are added to the existing data set. In the example below, we are going to generate three sets (Poisson, binary, and gamma) of correlated data with means that are a function of the variable xbase
, which varies by id.
First we define the data and generate a data set:
def <- defData(varname = "xbase", formula = 5, variance = 0.2, dist = "gamma", id = "cid") def <- defData(def, varname = "lambda", formula = "0.5 + 0.1 * xbase", dist="nonrandom", link = "log") def <- defData(def, varname = "p", formula = "-2.0 + 0.3 * xbase", dist="nonrandom", link = "logit") def <- defData(def, varname = "gammaMu", formula = "0.5 + 0.2 * xbase", dist="nonrandom", link = "log") def <- defData(def, varname = "gammaDis", formula = 1, dist="nonrandom") dt <- genData(10000, def) dt ## cid xbase lambda p gammaMu gammaDis ## 1: 1 12.1128232 5.536056 0.8366960 18.588900 1 ## 2: 2 4.9148342 2.695230 0.3715554 4.405998 1 ## 3: 3 11.5550282 5.235712 0.8125261 16.626630 1 ## 4: 4 3.0802596 2.243475 0.2542785 3.052778 1 ## 5: 5 0.9767811 1.817893 0.1535577 2.004423 1 ## --- ## 9996: 9996 6.0564517 3.021173 0.4543613 5.536100 1 ## 9997: 9997 3.1298866 2.254636 0.2571119 3.083229 1 ## 9998: 9998 12.4642670 5.734076 0.8505956 19.942505 1 ## 9999: 9999 4.6559318 2.626345 0.3536072 4.183660 1 ## 10000: 10000 3.4314285 2.323658 0.2747666 3.274895 1
The Poisson distribution has a single parameter, lambda:
dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = 0.1, corstr = "cs", dist = "poisson", param1 = "lambda", cnames = "a, b, c") dtX1[, .(cid, xbase, lambda, a, b, c)] ## cid xbase lambda a b c ## 1: 1 12.1128232 5.536056 4 6 7 ## 2: 2 4.9148342 2.695230 2 4 1 ## 3: 3 11.5550282 5.235712 5 6 4 ## 4: 4 3.0802596 2.243475 1 3 1 ## 5: 5 0.9767811 1.817893 2 1 0 ## --- ## 9996: 9996 6.0564517 3.021173 1 3 3 ## 9997: 9997 3.1298866 2.254636 2 3 1 ## 9998: 9998 12.4642670 5.734076 4 6 8 ## 9999: 9999 4.6559318 2.626345 2 3 5 ## 10000: 10000 3.4314285 2.323658 0 0 3
The Bernoulli (binary) distribution has a single parameter, p:
dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "ar1", dist = "binary", param1 = "p") dtX2[, .(cid, xbase, p, V1, V2, V3, V4)] ## cid xbase p V1 V2 V3 V4 ## 1: 1 12.1128232 0.8366960 1 1 1 1 ## 2: 2 4.9148342 0.3715554 0 0 0 0 ## 3: 3 11.5550282 0.8125261 1 1 1 1 ## 4: 4 3.0802596 0.2542785 0 1 0 0 ## 5: 5 0.9767811 0.1535577 0 0 0 1 ## --- ## 9996: 9996 6.0564517 0.4543613 0 0 0 0 ## 9997: 9997 3.1298866 0.2571119 1 0 0 0 ## 9998: 9998 12.4642670 0.8505956 0 1 1 1 ## 9999: 9999 4.6559318 0.3536072 1 1 0 0 ## 10000: 10000 3.4314285 0.2747666 1 0 1 1
And here is the the Gamma distribution, with its two parameters (mean and dispersion):
dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .4, corstr = "cs", dist = "gamma", param1 = "gammaMu", param2 = "gammaDis") dtX3[, .(cid, xbase, gammaMu, gammaDis, V1 = round(V1,2), V2 = round(V2,2), V3 = round(V3,2))] ## cid xbase gammaMu gammaDis V1 V2 V3 ## 1: 1 12.1128232 18.588900 1 11.24 3.44 9.11 ## 2: 2 4.9148342 4.405998 1 0.91 3.77 0.76 ## 3: 3 11.5550282 16.626630 1 68.47 12.91 1.72 ## 4: 4 3.0802596 3.052778 1 2.54 3.63 2.98 ## 5: 5 0.9767811 2.004423 1 0.39 0.14 0.42 ## --- ## 9996: 9996 6.0564517 5.536100 1 0.29 4.84 1.80 ## 9997: 9997 3.1298866 3.083229 1 4.81 0.38 0.81 ## 9998: 9998 12.4642670 19.942505 1 17.10 3.56 4.04 ## 9999: 9999 4.6559318 4.183660 1 1.17 0.21 1.47 ## 10000: 10000 3.4314285 3.274895 1 1.02 1.61 2.24
Long form data
If we have data in long form (e.g. longitudinal data), the function will recognize the structure:
def <- defData(varname = "xbase", formula = 5, variance = .4, dist = "gamma", id = "cid") def <- defData(def, "nperiods", formula = 3, dist = "noZeroPoisson") def2 <- defDataAdd(varname = "lambda", formula = "0.5 + 0.5 * period + 0.1 * xbase", dist="nonrandom", link = "log") dt <- genData(1000, def) dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3) dtLong <- addColumns(def2, dtLong) dtLong ## cid period xbase nperiods timeID lambda ## 1: 1 0 6.693980 1 1 3.220053 ## 2: 1 1 6.693980 1 2 5.308971 ## 3: 1 2 6.693980 1 3 8.753013 ## 4: 2 0 10.008645 2 4 4.485565 ## 5: 2 1 10.008645 2 5 7.395447 ## --- ## 2996: 999 1 6.753605 2 2996 5.340720 ## 2997: 999 2 6.753605 2 2997 8.805359 ## 2998: 1000 0 2.006781 4 2998 2.015119 ## 2999: 1000 1 2.006781 4 2999 3.322369 ## 3000: 1000 2 2.006781 4 3000 5.477661 ### Generate the data dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3, rho = .6, corstr = "cs", dist = "poisson", param1 = "lambda", cnames = "NewPois") dtX3 ## cid period xbase nperiods timeID lambda NewPois ## 1: 1 0 6.693980 1 1 3.220053 3 ## 2: 1 1 6.693980 1 2 5.308971 5 ## 3: 1 2 6.693980 1 3 8.753013 9 ## 4: 2 0 10.008645 2 4 4.485565 2 ## 5: 2 1 10.008645 2 5 7.395447 4 ## --- ## 2996: 999 1 6.753605 2 2996 5.340720 6 ## 2997: 999 2 6.753605 2 2997 8.805359 11 ## 2998: 1000 0 2.006781 4 2998 2.015119 2 ## 2999: 1000 1 2.006781 4 2999 3.322369 4 ## 3000: 1000 2 2.006781 4 3000 5.477661 7
We can fit a generalized estimating equation (GEE) model and examine the coefficients and the working correlation matrix. As we would expect, they match closely to the data generating parameters:
geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid, family = poisson, corstr = "exchangeable") ## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27 ## running glm to get initial regression estimate ## (Intercept) period xbase ## 0.52045259 0.50354885 0.09746544 round(summary(geefit)$working.correlation, 2) ## [,1] [,2] [,3] ## [1,] 1.00 0.58 0.58 ## [2,] 0.58 1.00 0.58 ## [3,] 0.58 0.58 1.00
In the future, I plan on adding other distributions. Some folks have suggested the negative binomial distribution, which I will do. If you have other suggestions/requests, let me know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.