Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Back in January 2013 I wrote a blog post showing how to implement a basic cluster/block bootstrap in R. One drawback of the cluster bootstap is the length of time it takes to sample with replacement and create the data samples. Thankfully some of the comments on my previous post illustrated simple ways to get speed gains. However, even with these gains this procedure is extremely time consuming.
I have been using the cluster bootstrap in some of my research and have found another way to speed things up—use parallel processing power. I appreciate I might be somewhat late to the multicore functions, but hopefully somebody has been having a similar issue as me can take solace from this post.
In the code below I demonstrate how the function “clusterApply” from the package “snow” can be used as a replacement for the regular “apply” function. Note the cluster in clusterApply refers to the mulitcore clusters rather than the clusters in the data frame. My code sets up a simple regression problem, wherein the standard error of the the regressor is 0.4. To demonstrate the clustering phenomenon I duplicate the data frame of 10,000 observations 20 times. As a result of this the standard error falls to 0.09 based on the naive estimate of the variance-covariance matrix.
The clustering problem can easily be corrected using the “felm” function from (what I consider the best R package) “lfe”. However, there are many occasions where researchers might want to use econometric techniques that do not lend themselves to a simple variance-covariance correction like the OLS or 2SLS estimators. These are the situations where you wan to use the cluster bootstrap.
The code below demonstrates how this can be done with and without using parallel processing. The only difference is that the parallel processing requires the user to set the number of clusters (again not clusters in the data!) and use clusterApply instead of apply. In this application, using parallel processing reduces the cluster bootstrap time down from 5 mins 42 seconds to 4 mins 6 seconds. This might seem reasonably trivial however in this simple application I am using a relatively small number of observations (10,000). The parallel processing method will get relatively quicker the larger the number of observations. Also, you can increase this by having a computer with more cores.
I appreciate any comments or criticism people might have on the code below. If anybody can think of a way that would help me to speed this up even more I would be delighted to hear it.
# cluster bootstrap with paralell processing rm(list=ls()) # packages for cluster standard errors library(lmtest) library(lfe) # use multicore functions library(snow) # set up simulation n <- 10000 # number of observations x <- rnorm(n) y <- 5 + 2*x + rnorm(n, 0, 40) # regression m1 <- lm(y ~ x) summary(m1) # standard error is 0.4 # duplicate data dpt <- 20 # dpt times dat <- data.frame(x = rep(x, dpt) , y = rep(y, dpt), g = rep(1:n, dpt)) # regressions with no clustering m2 <- lm(y ~ x, data = dat) # smaller StErrs summary(m2) # standard errors are like m1 = 0.09 # now cluster summary(felm(y ~ x | 0 | 0 | g, data = dat)) # standard errors are like m1 = 0.4 # lets do this with a regular cluster bootstap reps <- 50 # 50 reps in practice do more clusters <- unique(dat$g) boot.res1 <- matrix(NA, nrow = reps, ncol = 1) # open time stamp t1 <- Sys.time() # set the seed set.seed(12345) # do in loop for(i in 1:reps){ # sample the clusters with replacement units <- sample(clusters, size = length(clusters), replace=T) # create bootstap sample with sapply df.bs <- sapply(units, function(x) which(dat[,"g"]==x)) df.bs <- dat[unlist(df.bs),] boot.res1[i] <- coef(lm(y ~ x, data = df.bs))[2] } # close time stamp t2 <- Sys.time() t2 - t1 sd(boot.res1) # good bootstrap standard errors are = 0.4 # now lets speed up the sapply function from the previous example boot.res2 <- matrix(NA, nrow = reps, ncol = 1) # set the seed set.seed(12345) cl <- makeCluster(10) # open time stamp t3 <- Sys.time() # do in loop for(i in 1:reps){ # sample the clusters with replacement units <- sample(clusters, size = length(clusters), replace = T) # now use the 10 cores instead of 1! clusterExport(cl, c("dat", "units")) # cluster apply instead of regular apply df.bs = clusterApply(cl, units, function(x) which(dat$g == x)) df.bs <- dat[unlist(df.bs),] boot.res2[i] <- coef(lm(y ~ x, data = df.bs))[2] } # close time stamp t4 <- Sys.time() t4 - t3 stopCluster(cl) sd(boot.res2) # good bootstrap standard errors are = 0.4
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.