Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As I noted here, enclosing sub-expressions in parentheses is slower in R than enclosing them in curly brackets. I now know why, and I’ve modified R to reduce (but not eliminate) the slowness of parentheses. The modification speeds up many other operations in R as well, for an average speedup of something like 5% for programs that aren’t dominated by large built-in operations like matrix multiplies.
I looked at the source code for the latest version of R, 2.11.1, and figured out what’s going on. The difference between parentheses and curly brackets comes about because R treats curly brackets as a “special” operator, whose arguments are not automatically evaluated, but it treats parentheses as a “built in” operator, whose arguments (just one for parentheses) are evaluated automatically, with the results of this evaluation stored in a LISP-style list. Creating this list requires allocation of memory and other operations which seem to be slow enough to cause the difference, since the curly bracket operator just evaluates the expressions inside and returns the last of them, without creating such a list. Furthermore, the implementation allocates one more “CONS” cell than is necessary when creating this list, apparently because the programmer found this slightly easier than writing code to avoid creating the extra cell.
Since this is time critical code, not just for parentheses but for most operators, allocating a cell unnecessarily is a bad idea. I modified the program to avoid this, changing the evalList and evalListKeepMissing procedures in the eval.c source file in the src/main directory. Here are the old and new versions of these procedures (probably of interest only to R hackers).
Here are some timing results. First with the R 2.11.1 unmodified, on an Intel Linux system:
> a <- 5; b <- 1; c <- 4 > f <- function (n) for (i in 1:n) d <- 1/{a*{b+c}} > g <- function (n) for (i in 1:n) d <- 1/(a*(b+c)) > system.time(f(1000000)) user system elapsed 1.830 0.010 1.842 > system.time(g(1000000)) user system elapsed 1.990 0.000 1.988
Parentheses are 8.7% slower than curly brackets.
Now here are the results with my modified version:
> a <- 5; b <- 1; c <- 4 > f <- function (n) for (i in 1:n) d <- 1/{a*{b+c}} > g <- function (n) for (i in 1:n) d <- 1/(a*(b+c)) > system.time(f(1000000)) user system elapsed 1.780 0.000 1.783 > system.time(g(1000000)) user system elapsed 1.820 0.000 1.828
Now parentheses are only 2.2% slower than curly brackets. Furthermore, the curly bracket version is 2.7% faster than before, and the parenthesis version is 8.5% faster.
Here’s a test on a program that implements EM for a simple model. First, with the unmodified version of R (running far more iterations of EM than really needed):
> system.time(print(EM.censored.poisson(5,3.2,20,iterations=1000000))) [1] 1.049699 user system elapsed 13.690 0.020 13.703
And now my modified version of R:
> system.time(print(EM.censored.poisson(5,3.2,20,iterations=1000000))) [1] 1.049699 user system elapsed 12.860 0.020 12.878
My new version is 6% faster than the old version of R.
The size of the improvement will vary a lot from one R program to another, of course. (The nature of the change is such that I don’t think any programs will be slower.) But one can see that the simple modification I made has produced a significant improvement — a 6% speedup is a lot for such a small change. From looking at the source code for R, I think a number of other improvements can be made without great effort. In particular, arithmetic on large vectors can be speeded up substantially, including the squaring operation discussed here. The changes to do this will also speed up arithmetic on short vectors, including scalars, but I don’t yet know how significant the improvement will be in that context.
The fundamental reason for the slowness of the current R implementation seems to be a lack of attention by the implementors to efficiency in code that is on the critical interpretive pathway. This is the opposite fault to what is often seen among hackers — an obsessive focus on efficiency even where it doesn’t matter. One should focus on efficiency where it matters, and focus on other things where it doesn’t matter (which is typically 95% of a program). On the plus side, the R source code is fairly readable, even though it lacks much in the way of informative comments. It was readable enough that I was able to figure out what’s going on and modify the program in less than a day. My next task is to figure out how to get the modification into the next release of R…
UPDATE: I forgot to mention that the real solution to speeding up parentheses is to get rid of them entirely. The only reason for them to be preserved in R’s internal expression tree is so that they can be printed when an expression is deparsed. But for that, one could just have a field in each expression node saying how many pairs of parentheses directly enclose it (usually 0 or 1, except for people who really like redundant parentheses), which can be accessed when deparsing the expression, but which would not slow down execution of the program.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.