Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One often hears that R
can not be fast (false), or more correctly that for fast code in R
you may have to consider “vectorizing.”
A lot of knowledgable R
users are not comfortable with the term “vectorize”, and not really familiar with the method.
“Vectorize” is just a slightly high-handed way of saying:
R
naturally stores data in columns (or in column major order), so if you are not coding to that pattern you are fighting the language.
In this article we will make the above clear by working through a non-trivial example of writing vectorized code.
For our example problem we will take on the task of computing the PRESS statistic (a statistic we use in our new RcppDynProg
package). The “predicted residual error sum of squares” or PRESS statistic is an estimate of the out of sample quality of a fit. We motivate the PRESS statistic as follows.
Suppose we are fitting a simple linear model. In such a case it is natural to examine the model residuals (differences between the actual values and the matching predictions):
d <- data.frame( x = 1:6, y = c(1, 1, 2, 2, 3, 3)) lm(y ~ x, data = d)$residuals # 1 2 3 4 5 6 # 0.1428571 -0.3142857 0.2285714 -0.2285714 0.3142857 -0.1428571
However, because the model has seen the data it is being applied to, these residuals are not representative of the residuals we would see on new data (there is a towards-zero bias in this estimate). An improved estimate is the PRESS statistic: for each point the model is fit on all points except the point in question, and then the residuals are estimated. This is a form of cross-validation and is easy to calculate:
xlin_fits_lm <- function(x, y) { n <- length(y) d <- data.frame(x = x, y = y) vapply( seq_len(n), function(i) { m <- lm(y ~ x, data = d[-i, ]) predict(m, newdata = d[i, ]) }, numeric(1)) } d$y - xlin_fits_lm(d$x, d$y) # [1] 0.3000000 -0.4459459 0.2790698 -0.2790698 0.4459459 -0.3000000
Notice these values tend to be further from zero. They also better represent how an overall model might perform on new data.
At this point, from a statistical point of view, we are done. However, re-building an entire linear model for each point is computationally inefficient. Each of the models we are calculating share many training points, so we should be able to build a much faster hand-rolled calculation.
That faster calculation is given in the rather formidable function xlin_fits_R()
found here. This function computes the summary statistics needed to solve the linear regression for all of the data. Then for each point in turn, it subtracts that point’s contribution out of the summaries and then performs the algebra needed to solve for the model and then apply it. The advantage of this approach is that taking the point out and solving the model is only a very small (constant) number of steps independent of how many points are in the summary. So this code is, in principle, very efficient.
And in fact timings on a small problem (300 observations) show while the simple “xlin_fits_lm()
call lm()
a bunch of times” takes 0.28 seconds, the more complicated xlin_fits_R()
takes 0.00043 seconds, a speedup of over 600 times!
However this code is performing a separate calculation for each scalar data-point. As we mentioned above, this is fighting R
, which is specialized for performing calculations over large vectors. The exact same algorithm written in C++
, instead of R
, takes 0.000055 seconds, almost another multiple of 10 faster!
The timings are summarized below.
This sort of difference, scalar oriented C++
being so much faster than scalar oriented R
, is often distorted into “R
is slow.”
This is just not the case. If we adapt the algorithm to be vectorized we get an R
algorithm with performance comparable to the C++
implementation!
Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (xlin_fits_R()
) is a bit complicated, but the vectorized version (xlin_fits_V()
) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in R
) to working over a small number of vectors (fast in R
).
Let’s take a look at the code transformation.
We are not saying that xlin_fits_R()
or xlin_fits_V()
are easy to understand; we felt pretty slick when we derived them and added a lot of tests to confirm they calculate the same thing as xlin_fits_lm()
. What we are saying is that the transform from xlin_fits_R()
to xlin_fits_V()
is simple: just cross out the for-loop and all of the “[k]
” indices!
Performing the exact same operation on every entry in a structure (but with different values) is the essence of “vectorized code.” When we wrote a for
-loop in xlin_fits_R()
to perform the same steps for each index, we were in fact fighting the language. Crossing out the for
-loop and indices mechanically turned the scalar code into faster vector code.
And that is our example of how and why to vectorize code in R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.