R and the Collatz Conjecture: Part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics
In the last article, we showed two separate R implementations of the Collatz conjecture: 'nonvec_collatz' and 'vec_collatz', with the latter being more efficient than the former because of the way it takes advantage of vectorization in R. Let's once again take a look at 'vec_collatz':
vec_collatz <- function(ints) { + niter <- integer(length(ints)) + while (abs(sum(ints - 1)) > .01) { + niter <- niter + ifelse(ints == 1, 0, 1) + ints <- ifelse(ints == 1, ints, ifelse(ints %% 2 == 0, ints / 2, 3*ints + 1)) + } + + niter + }
Today we will learn a thrid, and far more efficient way of implementing the Collatz conjecture. It involves rewriting the function in C++ and using the 'Rcpp' package in R to compile and run the function without ever leaving the R environment.
library("Rcpp")
One important difference between R and C++ is that when you write a C++ function, you need to declare your variable types. The C++ code chunk shown below creates a function called 'cpp_collatz' which takes an input of type 'IntegerVector' and whose output is of type 'IntegerVector'. Unlike in R, where explicit loops can slow your code down, loops in C++ are usually very efficient, even though they are tedious to write.
cpptxt <- ' + IntegerVector cpp_collatz(IntegerVector ints) { + IntegerVector iters(ints.size()); + for (int i=0; i<ints.size(); i++) { + int nn = ints(i); + while (nn != 1) { + if (nn % 2 == 0) nn /= 2; + else nn = 3 * nn + 1; + iters(i) += 1; + } + } + return iters; + }' cpp_collatz <- cppFunction(cpptxt) set.seed(20) cpp_collatz(sample(20)) [1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9
set.seed(20) a <- vec_collatz(sample(2000)) set.seed(20) b <- cpp_collatz(sample(2000)) all.equal(a, b) # should return TRUE [1] TRUE
Let's now redo our C++ implementation in a slightly different way. We would rather not have C++ code interspersed with R code: Not only does it make it hard to read, but we also won't be able to take advantage of syntax-hightlighting specific to C++ (among other annoyances). So let's store the C++ code in a file we call 'collatz.cpp' and use the 'sourceCpp' function in R to call it. Here is the content of 'collatz.cpp':
cat(paste(readLines(file("collatz.cpp")), collapse = "\n")) #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { IntegerVector iters(ints.size()); for (int i=0; i<ints.size(); i++) { iters(i) = collatz(ints(i)); } return iters; } // [[Rcpp::export]] IntegerVector sug_collatz(IntegerVector ints) { return sapply(ints, collatz); }
There are three things worth mentioning about the above code chunk:
- We broke up the function in two: one function 'collatz' that operates on a single integer and returns its stopping time, and another function 'cpp_collatz' that runs the 'collatz' function on a vector of integers and returns a vector of stopping times. This makes the code easier to read (no more nested loops) and more modular.
- There are two extra lines at the top, which tell C++ what namespace to use, as well as an extra line '// [[Rcpp::export]]' before each function definition. The latter is used to let Rcpp know that theses functions should be made available to R once they have been compiled.
- The third function, called 'sug_collatz', does the same thing as 'cpp_collatz' but uses the Rcpp-sugar extension, which is an attempt to take common R functions and rewrite them in C++ so the C++ code can look like its R counterpart when possible. This can sometimes come at a very small cost but it saves us a lot of hassle, as you can see.
To compile the C++ code, just type the following in R:
sourceCpp("collatz.cpp")
Assuming that we have a C++ compiler istalled, it will take a few seconds to run. We can now type the following into the R console:
cpp_collatz function (ints) .Primitive(".Call")(<pointer: 0x000000006e042180>, ints) cpp_collatz(1:10) # seems to be working fine. [1] 0 1 7 2 5 8 16 3 19 6 sug_collatz(1:10) # same as above. [1] 0 1 7 2 5 8 16 3 19 6
And now let's get to the reason we bothered with C++ in the first place: efficiency. There are four comparisons we're interested in:
- the 'cpp_collatz' function
- the 'sug_collatz' function, which uses 'sapply' in C++ instead of an explicit loop
- the 'vec_collatz' function we wrote earlier which is our R implementation
- the 'collatz' function we wrote earlier in C++, which works only on a single integer, but we can vectorize it by wrapping it in 'sapply' in R.
We can think of the last case as being a hybrid approach. One reason to include it is because someone may share a complicated piece of C++ code that works but is not vectorized and vectorizing it may turn out to be a non-trivial task (programmers are supposed to be lazy after all!).
collatz_benchmark <- function(nums, ...) { + require(rbenchmark) + benchmark( + cpp_collatz(nums), # runs completely in C++ + sug_collatz(nums), # runs completely in C++ + vec_collatz(nums), # runs completely in R in a vectorized fashion + sapply(nums, collatz), # runs collatz in C++ but sapply in R + columns = c("test", "replications", "elapsed", "relative"), + ... + ) + }
Let's compare the two functions for all integers from 1 to 10^4:
collatz_benchmark(1:10^4, replications = 20) test replications elapsed relative 1 cpp_collatz(nums) 20 0.03 1.000 4 sapply(nums, collatz) 20 0.53 17.667 2 sug_collatz(nums) 20 0.03 1.000 3 vec_collatz(nums) 20 51.36 1712.000
And for all integers from 1 to 10^5:
collatz_benchmark(1:10^5, replications = 20) test replications elapsed relative 1 cpp_collatz(nums) 20 0.45 1.071 4 sapply(nums, collatz) 20 5.76 13.714 2 sug_collatz(nums) 20 0.42 1.000 3 vec_collatz(nums) 20 753.08 1793.048
As we can see 'cpp_collatz' and 'sug_collatz' are almost identical when it comes to efficiency and both are far more efficient than 'vec_collatz', and increasingly so for larger sequences of integers. Also notice the relative efficiency of the hybrid approach compared to 'vec_collatz'.
Let's benchmark on a sample of 1000 integers from 1 to 10^6 for a more realistic comparison of four approaches:
set.seed(20) collatz_benchmark(sample(1:10^6, 1000), replications = 100) # will take a while to run! test replications elapsed relative 1 cpp_collatz(nums) 100 0.05 1.667 4 sapply(nums, collatz) 100 0.23 7.667 2 sug_collatz(nums) 100 0.03 1.000 3 vec_collatz(nums) 100 31.15 1038.333
The results are remarkable: the C++ functions 'sug_collatz' is the winner with 'cpp_collatz' a close second. The slight advantage of 'sug_collatz' may be due to 'sapply' in C++ using a more efficient method for running the loop (such as iterators). Moreover, both functions are about 1000 times faster than 'vec_collatz'! Even though 'vec_collatz' was specifically written to take advantage of vectorization in R, it still pales in comparison to the might of C++. Even more surprising is that the hybrid approach is only about 8 times slower than C++ approach, which is not bad at all. But if your "vectorization" consists of wrapping a function in 'sapply', then it's best to do it in C++ as we did with 'sug_collatz' than to do it in R.
One take-away lesson for us is that istead of spending a lot of time and effort making our R code as efficient as possible, it may be worth investing that time in learning how to code in a language like C++. Especially since packages like 'Rcpp' and the 'Rcpp-sugar' extension give us the best of both worlds. Another take away is that hybrid solutions are a good alternative in cases when our C++ code is efficient but not vectorized.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.