Speeding up the Cluster Bootstrap in R

diffuseprior

6 years ago

[This article was first published on DiffusePrioR » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back in January 2013 I wrote a blog post showing how to implement a basic cluster/block bootstrap in R. One drawback of the cluster bootstap is the length of time it takes to sample with replacement and create the data samples. Thankfully some of the comments on my previous post illustrated simple ways to get speed gains. However, even with these gains this procedure is extremely time consuming.

I have been using the cluster bootstrap in some of my research and have found another way to speed things up—use parallel processing power. I appreciate I might be somewhat late to the multicore functions, but hopefully somebody has been having a similar issue as me can take solace from this post.

In the code below I demonstrate how the function “clusterApply” from the package “snow” can be used as a replacement for the regular “apply” function. Note the cluster in clusterApply refers to the mulitcore clusters rather than the clusters in the data frame. My code sets up a simple regression problem, wherein the standard error of the the regressor is 0.4. To demonstrate the clustering phenomenon I duplicate the data frame of 10,000 observations 20 times. As a result of this the standard error falls to 0.09 based on the naive estimate of the variance-covariance matrix.

The clustering problem can easily be corrected using the “felm” function from (what I consider the best R package) “lfe”. However, there are many occasions where researchers might want to use econometric techniques that do not lend themselves to a simple variance-covariance correction like the OLS or 2SLS estimators. These are the situations where you wan to use the cluster bootstrap.

The code below demonstrates how this can be done with and without using parallel processing. The only difference is that the parallel processing requires the user to set the number of clusters (again not clusters in the data!) and use clusterApply instead of apply. In this application, using parallel processing reduces the cluster bootstrap time down from 5 mins 42 seconds to 4 mins 6 seconds. This might seem reasonably trivial however in this simple application I am using a relatively small number of observations (10,000). The parallel processing method will get relatively quicker the larger the number of observations. Also, you can increase this by having a computer with more cores.

I appreciate any comments or criticism people might have on the code below. If anybody can think of a way that would help me to speed this up even more I would be delighted to hear it.

# cluster bootstrap with paralell processing
rm(list=ls())

# packages for cluster standard errors
library(lmtest)
library(lfe)
# use multicore functions
library(snow)

# set up simulation
n <- 10000 # number of observations
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, 0, 40)
# regression
m1 <- lm(y ~ x)
summary(m1)
# standard error is 0.4

# duplicate data
dpt <- 20 # dpt times
dat <- data.frame(x = rep(x, dpt) , y = rep(y, dpt), g = rep(1:n, dpt))

# regressions with no clustering
m2 <- lm(y ~ x, data = dat) # smaller StErrs
summary(m2)
# standard errors are like m1 = 0.09

# now cluster
summary(felm(y ~ x | 0 | 0 | g, data = dat))
# standard errors are like m1 = 0.4

# lets do this with a regular cluster bootstap
reps <- 50 # 50 reps in practice do more
clusters <- unique(dat$g)
boot.res1 <- matrix(NA, nrow = reps, ncol = 1)

# open time stamp
t1 <- Sys.time()
# set the seed 
set.seed(12345)
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace=T)
  # create bootstap sample with sapply
  df.bs <- sapply(units, function(x) which(dat[,"g"]==x))
  df.bs <- dat[unlist(df.bs),]
  boot.res1[i] <- coef(lm(y ~ x, data = df.bs))[2]
}
# close time stamp
t2 <- Sys.time()
t2 - t1 

sd(boot.res1) # good bootstrap standard errors are = 0.4

# now lets speed up the sapply function from the previous example

boot.res2 <- matrix(NA, nrow = reps, ncol = 1)

# set the seed 
set.seed(12345)

cl <- makeCluster(10)
# open time stamp
t3 <- Sys.time()
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace = T)
  # now use the 10 cores instead of 1!
  clusterExport(cl, c("dat", "units"))
  # cluster apply instead of regular apply
  df.bs = clusterApply(cl, units, function(x) which(dat$g == x))
  df.bs <- dat[unlist(df.bs),]
  boot.res2[i] <- coef(lm(y ~ x, data = df.bs))[2]
}
# close time stamp
t4 <- Sys.time()
t4 - t3 

stopCluster(cl)

sd(boot.res2) # good bootstrap standard errors are = 0.4

To leave a comment for the author, please follow the link and comment on their blog: DiffusePrioR » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.