Implementing a one-step GEE algorithm for very large cluster sizes in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis. Estimation in this case is not really an issue because the cluster sizes are typically relatively small. However, if there are groups of individuals, we also need to account for correlation. Unfortunately, if these group/cluster sizes are too large – perhaps bigger than 1000 – traditional GEE estimation techniques just may not be feasible.
An approach to GEE that is feasible has been described by Lipsitz et al and implemented using a SAS
macro (which is available on the journal’s website). I am not much of a SAS
user, so I searched for an implementation in R
. Since I didn’t come across anything, I went ahead and implemented it myself. I am undecided about creating an R
package for this, but in the meantime I thought I would compare it to a standard package in R
and provide a link to the code if you’d like to implement it yourself. (And, if it does already exist in an R
package, definitely let me know, because I certainly don’t want to duplicate anything.)
The one-step GEE algorithm
Traditional GEE models (such as those fit with R
packages gee
and geepack
) allow for flexibility in specifying the within-cluster correlation structure (we generally still assume that individuals in different clusters are uncorrelated). For example, one could assume that the correlation across individuals is constant within a cluster. We call this exchangeable or compound symmetry correlation, and the intra-cluster correlation (ICC) is the measure of that correlation. Alternatively, if measurements are collected over time, we might assume that measurements taken closer together are more highly correlated; this is called auto-regressive correlation.
The proposed algorithm that is implemented here is called the one-step GEE, and is operating under the assumption of exchangeable correlation. To provide a little more detail on the algorithm, but to keep it simple, let me quote directly from the paper’s abstract:
We propose a one-step GEE estimator that (1) matches the asymptotic efficiency of the fully iterated GEE; (2) uses a simpler formula to estimate the [intra-cluster correlation] ICC that avoids summing over all pairs; and (3) completely avoids matrix multiplications and inversions. These three features make the proposed estimator much less computationally intensive, especially with large cluster sizes. A unique contribution of this article is that it expresses the GEE estimating equations incorporating the ICC as a simple sum of vectors and scalars.
The rest of the way, I will simulate data and fit the models using the traditional estimation approach as well as the one-step approach.
Comparing standard GEE with one-step GEE
To start, I am simulating a simple data set with 100 clusters that average 100 individuals per cluster.
library(simstudy) library(data.table) library(geepack)
First, the definitions used in the data generation. Each individual has three covariates, and the probability is a function of two of them:
d1 <- defData(varname = "n", formula = 100, dist = "noZeroPoisson") d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "p", formula = "-0.7 + 0.7*x1 - 0.4*x2", dist = "nonrandom", link="logit")
And then the data generation - the final step creates the within-site correlated outcomes with an ICC of 0.15:
set.seed(1234) ds <- genData(100, d1, id = "site") dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id") dc <- addColumns(d2, dc) dd <- addCorGen(dc, idvar = "site", param1 = "p", rho = 0.15, corstr = "cs", dist = "binary", cnames = "y", method = "ep")
Here’s a few records from the data set, which has just under 10,000 observations across the 100 clusters:
dd ## site n id x1 x2 x3 p y ## 1: 1 88 1 -0.57111723 0.21022042 -0.31201547 0.2343570 0 ## 2: 1 88 2 -0.18406857 0.68915628 -0.72801775 0.2488957 0 ## 3: 1 88 3 -0.35066169 -0.10884256 0.28141278 0.2886548 0 ## 4: 1 88 4 -0.32095917 -0.03676544 -0.33847489 0.2870070 0 ## 5: 1 88 5 -0.05132678 0.07095837 -0.09308927 0.3177108 1 ## --- ## 9784: 100 106 9784 -0.26912777 -0.22893853 0.16303770 0.3107074 1 ## 9785: 100 106 9785 0.16399013 0.26200677 -0.03187061 0.3340309 0 ## 9786: 100 106 9786 0.14450843 -0.24399797 -0.28108422 0.3772482 1 ## 9787: 100 106 9787 0.29307929 0.03889844 -0.37495296 0.3750989 0 ## 9788: 100 106 9788 -0.06246070 -0.41041326 0.43236084 0.3590345 1
We can fit a regular GEE model here, since the cluster sizes are relatively small:
system.time(geefit <- geese(y ~ x1 + x2 + x3, id = site, data = dd, family = binomial, corstr = "exchangeable")) ## user system elapsed ## 2.264 0.102 1.946 summary(geefit) ## ## Call: ## geese(formula = y ~ x1 + x2 + x3, id = site, data = dd, family = binomial, ## corstr = "exchangeable") ## ## Mean Model: ## Mean Link: logit ## Variance to Mean Relation: binomial ## ## Coefficients: ## estimate san.se wald p ## (Intercept) -0.7275165 0.08459309 73.9632404 0.000000e+00 ## x1 0.7396758 0.06584516 126.1929384 0.000000e+00 ## x2 -0.3191633 0.06073004 27.6196887 1.476680e-07 ## x3 -0.0477172 0.06113708 0.6091727 4.350995e-01 ## ## Scale Model: ## Scale Link: identity ## ## Estimated Scale Parameters: ## estimate san.se wald p ## (Intercept) 0.9954265 0.03464657 825.4633 0 ## ## Correlation Model: ## Correlation Structure: exchangeable ## Correlation Link: identity ## ## Estimated Correlation Parameters: ## estimate san.se wald p ## alpha 0.1441719 0.02168204 44.21411 2.943523e-11 ## ## Returned Error Value: 0 ## Number of clusters: 100 Maximum cluster size: 125
The one-step GEE function (which I’ve called gee1step
) runs quite a bit faster than the standard GEE model (more than 10 times faster), but the results are virtually identical.
system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site")) ## user system elapsed ## 0.133 0.019 0.087 fit1 ## $estimates ## est se.err z p.value ## Intercept -0.72743563 0.08549337 -8.5086793 8.796283e-18 ## x1 0.73943837 0.06704127 11.0296000 1.375430e-28 ## x2 -0.31903588 0.06136514 -5.1989760 1.001947e-07 ## x3 -0.04758544 0.06165700 -0.7717768 2.201233e-01 ## ## $rho ## [1] 0.1440159 ## ## $clusters ## $clusters$n_clusters ## [1] 100 ## ## $clusters$avg_size ## [1] 97.88 ## ## $clusters$min_size ## [1] 77 ## ## $clusters$max_size ## [1] 125 ## ## ## $outcome ## [1] "y" ## ## $model ## y ~ x1 + x2 + x3 ## ## attr(,"class") ## [1] "gee1step"
The one-step algorithm with very large cluster sizes
Obviously, in the previous example, gee1step
is unnecessary because geese
handled the data set just fine. But, in the next example, with an average of 10,000 observations per cluster, geese
will not run - at least not on my MacBook Pro. gee1step
does just fine. I’m generating the data slightly differently here since simstudy
doesn’t do well with extremely large correlation matrices. I’m using a random effect instead to induce correlation:
vicc <- iccRE(0.15, dist = "binary") d1 <- defData(varname = "n", formula = 10000, dist = "noZeroPoisson") d1 <- defData(d1, varname = "b", formula = 0, variance = vicc) d2 <- defDataAdd(varname = "x1", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "x2", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "x3", formula = 0, variance = .1, dist = "normal") d2 <- defDataAdd(d2, varname = "y", formula = "-0.7 + 0.7*x1 - 0.4*x2 + b", dist = "binary", link="logit") ### generate data set.seed(1234) ds <- genData(100, d1, id = "site") dc <- genCluster(dtClust = ds, cLevelVar = "site", numIndsVar = "n", level1ID = "id") dd <- addColumns(d2, dc)
Now, we have almost one million observations:
dd ## site n b id x1 x2 x3 y ## 1: 1 9879 -1.3761022 1 -0.11929302 0.4136057653 -0.16119383 1 ## 2: 1 9879 -1.3761022 2 0.03086998 0.3085905336 0.46055431 0 ## 3: 1 9879 -1.3761022 3 0.51821656 0.2047086274 0.16721059 0 ## 4: 1 9879 -1.3761022 4 -0.27688665 0.0030742246 -1.08113944 0 ## 5: 1 9879 -1.3761022 5 0.03850389 -0.0991678903 -0.09447011 0 ## --- ## 997872: 100 10065 0.2648166 997872 0.46022770 0.1588510133 -0.11851768 0 ## 997873: 100 10065 0.2648166 997873 0.10696608 0.0002424567 -0.10632926 1 ## 997874: 100 10065 0.2648166 997874 0.32317258 -0.1626614787 0.37855380 1 ## 997875: 100 10065 0.2648166 997875 -0.17992641 -0.0333636043 -0.20060539 0 ## 997876: 100 10065 0.2648166 997876 0.65942651 0.1960137135 -0.05687481 1
Despite the very large cluster sizes, the one-step algorithm still runs very fast. In addition to what is shown here, I have conducted experiments with repeated data sets to confirm that the coefficient estimates are unbiased and the standard error estimates are correct.
system.time(fit1 <- gee1step(y ~ x1 + x2 + x3, data = dd, cluster = "site")) ## user system elapsed ## 2.226 0.329 1.611 fit1 ## $estimates ## est se.err z p.value ## Intercept -0.569908988 0.069360637 -8.216605 1.046722e-16 ## x1 0.615260991 0.010232501 60.128117 0.000000e+00 ## x2 -0.349871546 0.008694526 -40.240439 0.000000e+00 ## x3 0.008132196 0.006470784 1.256756 1.044210e-01 ## ## $rho ## [1] 0.1016716 ## ## $clusters ## $clusters$n_clusters ## [1] 100 ## ## $clusters$avg_size ## [1] 9978.76 ## ## $clusters$min_size ## [1] 9766 ## ## $clusters$max_size ## [1] 10242 ## ## ## $outcome ## [1] "y" ## ## $model ## y ~ x1 + x2 + x3 ## ## attr(,"class") ## [1] "gee1step"
And please, if someone thinks it would be valuable for me to create a package for this, let me know. It would certainly help motivate me :).
Reference:
Lipsitz, Stuart, Garrett Fitzmaurice, Debajyoti Sinha, Nathanael Hevelone, Jim Hu, and Louis L. Nguyen. “One-step generalized estimating equations with large cluster sizes.” Journal of Computational and Graphical Statistics 26, no. 3 (2017): 734-737.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.