Site icon R-bloggers

Speeding up model bootstrapping in GNU R

[This article was first published on R snippets, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After my last post I have recurringly received two questions: (a) is it worthwhile to analyze GNU R speed in simulations and (b) how would simulation speed compare between GNU R and Python. In this post I want to address the former question and next time I am going to tackle the latter.

An area in which I use simulation in GNU R on a daily basis is bootstrapping. Therefore I have decided to check out how much speedup one can expect with tuning of standard model estimation procedures.

I started with linear regression: a simple model with two explanatory variables and 100 observations. I wanted to compare speed of generation of 10000 bootstrap replications using standard lm function and bare matrix calculus formula for b solving the equation: X’Xb=X’y. Here is the simulation:

set.seed(1)< o:p>
n <- 100< o:p>
x1 <- runif(n)< o:p>
x2 <- runif(n)< o:p>
y <- x1 + x2 + 1 + rexp(n) # add non-normal error distribution< o:p>
data.m <- cbind(“(Intercept)” = 1, x1, x2, y)< o:p>
data.f <- data.frame(x1, x2, y)< o:p>

boot.bare <- function() {< o:p>
    dm <- data.m[sample.int(n, replace = T),]< o:p>
    X <- dm[, 1:3]< o:p>
    tX <- t(X)< o:p>
    y <- dm[, 4]< o:p>
    solve(tX %*% X, tX %*% y)< o:p>
}< o:p>

boot.lm <- function() {< o:p>
    lm(y~., data=data.f[sample.int(n, replace = T),])$coef< o:p>
}< o:p>

set.seed(2)< o:p>
system.time(rb <- replicate(10000, boot.bare())[, 1,])< o:p>
#  0.73 seconds< o:p>

set.seed(2)< o:p>
system.time(rl <- replicate(10000, boot.lm()))< o:p>
# 16.00 seconds< o:p>

# check if both procedures produce the same results< o:p>
range(rb rl)< o:p>
# OK difference ranges from -5.373479e-14 to 4.751755e-14

Both formulas yield practically identical results but using lm produces over 20 times slower execution. Of course lm does much more work than simple calculation of parameter estimates – boot for bootstrapping I need only the parameters.

So next I thought to check out a more complex model. Therefore I switched to logistic regression.
I compared glm estimation with two alternatives: brute-force optimization of log-likelihood using nlm and direct computation of estimates using iteratively reweighted least squares (IRLS, the method actually used by glm internally). Here is the code:

set.seed(1)< o:p>
n <- 100< o:p>
x1 <- runif(n)< o:p>
x2 <- runif(n)< o:p>
y <- x1 + x2 + runif(n) < 1.5 # add non-logistic error distribution< o:p>
data.m <- cbind(1, x1, x2, y)< o:p>
data.f <- data.frame(x1, x2, y)< o:p>

boot.nlm <- function() {< o:p>
    llik <- function(a) {< o:p>
        prob <- 1 / (1 + exp(-X %*% a))< o:p>
        prob[y == F] <- 1 prob[y == F]< o:p>
        sum(log(prob))< o:p>
    }< o:p>

    dm <- data.m[sample.int(n, replace = T),]< o:p>
    X <- dm[, 1:3]< o:p>
    tX <- t(X)< o:p>
    y <- dm[, 4]< o:p>
    nlm(llik, c(1, 2, 3))$estimate< o:p>
}< o:p>

boot.irls <- function() {< o:p>
    dm <- data.m[sample.int(n, replace = T),]< o:p>
    X <- dm[, 1:3]< o:p>
    tX <- t(X)< o:p>
    y <- dm[, 4]< o:p>
    b <- rep(0, 3)< o:p>
    while(TRUE) {< o:p>
        prob <- 1 / (1 + exp(-X %*% b))< o:p>
        pp <- as.vector(prob * (1 prob))< o:p>
        pp <- rbind(pp, pp, pp)< o:p>
        grad <- tX %*% (y prob)< o:p>
        b <- b + solve((tX * pp) %*% X, grad)< o:p>
        if (sum(abs(grad)) < 1e6) {< o:p>
            return(b)< o:p>
        }< o:p>
    }< o:p>
}< o:p>

boot.glm <- function() {< o:p>
    x <- data.f[sample.int(n, replace = T),]< o:p>
    glm(y~., data=x, family = “binomial”)$coef
}< o:p>

set.seed(2)< o:p>
system.time(rn <- replicate(10000, boot.nlm()))< o:p>
# 35.48 seconds< o:p>
# warnings related to procedure convergence produced< o:p>

set.seed(2)< o:p>
system.time(ri <- replicate(10000, boot.irls())[, 1,])< o:p>
# 5.49 seconds< o:p>

set.seed(2)< o:p>
system.time(rg <- replicate(10000, boot.glm()))< o:p>
# 34.68 seconds< o:p>

# check if all procedures produce the same results< o:p>
range(rn ri) # -0.0002572192  0.0002470589< o:p>
range(rn rg) # -0.0002572191  0.0002470588< o:p>
range(ri rg)# -4.196647e-07  3.602730e-07

We can see that using nlm is the slowest and produces unstable results and that IRLS and glm give identical results but IRLS is 6 times faster.

What are the conclusions? If you really need it is is possible to substantially reduce computational time even for “core” functions.
However, simplest means are not always guaranteed to be efficient (like nlm in logistic regression) and there is a question if the speedup pays of, as there is a substantial additional development effort needed.

To leave a comment for the author, please follow the link and comment on their blog: R snippets.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.