Code Optimization: One R Problem, Thirteen Solutions – Now Sixteen!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
The old r-wiki optimisation challenge describes a string generation problem which I have bloged about previously both here and here.
The Objective
To code the most efficient algorithm, using R, to produce a sequence of strings based on a single integer input, e.g.:
# n = 4 [1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004" # n = 5 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004" [9] "i003.005" "i004.005" # n = 6 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005" [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"
Solutions One Through Thirteen
A variety of different approaches are illustrated on the r-wiki page which show the performance benefits of things like vectorisation, variable initialisation, linking through to a compiled programming language, reducing a problem to its component parts, etc.
The Fourteenth Solution
The main speed improvement here comes from replacing the function “paste” by “file.path”. This use of “file.path” with parameter fsep=”” only works correctly here because there is never a character vector of length 0 for it to deal with. I only learned about this approach when I happened to see this tweet on twitter with hashtag #rstats and reading the associated help file where it says that it is faster than paste.
generateIndex14 <- function(n) { # initialise vectors s <- (mode = "character", length = n) # set up n unique strings s <- sprintf("%03d", seq_len(n)) # paste strings together unlist(lapply(1:(n-1), function(i) file.path("i", s[i], ".", s[(i+1):n], fsep = "") ), use.names = FALSE) }
Timings:
test elapsed n replications generateIndex14(n) 27.27500 2000 50 generateIndex13(n) 33.09300 2000 50 generateIndex12(n) 35.31344 2000 50 generateIndex11(n) 36.32900 2000 50
The Fifteenth Solution: Rcpp
This solution comes from Romain Francois and is based on the tenth solution but implemented in C++ using the R package Rcpp. See his blog for the implementation. This is the sort of thing I would love to learn to do myself but just need to find the time to re-learn C++, though I doubt that’ll happen any time soon as I’m hoping to start my MSc in Statistics next year. This is a great solution though.
Timings:
test elapsed n replications generateIndex15(n) 23.30100 2000 50 generateIndex14(n) 27.27500 2000 50 generateIndex13(n) 33.09300 2000 50 generateIndex12(n) 35.31344 2000 50 generateIndex11(n) 36.32900 2000 50
The Sixteenth Solution
When I was writing up this post I thought up a sixteenth solution (as seems to be the pattern with me on this blog!). This solution gets its speed up by generating the largest set of strings which start i001.xxx first and then replacing the “001” part with “002”, “003”, “004”, etc., for each increment up to and including n-1.
generateIndex16 <- function(n) { # initialise vectors str <- vector("list", length = n-1) s <- vector(mode = "character", length = n) # set up strings s <- sprintf("%03d", seq_len(n)) str[[1]] <- file.path("i", s[1], ".", s[-1], fsep = "") # generate string sequences str[2:(n-1)] <- lapply(2:(n-1), function(i) sub("001", s[i], str[[1]][i:(n-1)], fixed=TRUE)) unlist(str) }
The above requires matching the “001” part first and then replacing it. However, we know that “001” will ALWAYS be in character positions 2, 3 and 4, and so there may be a way to avoid the matching part altogether (i.e. replace a fixed position substring with another string of equal or larger length) but I could not work out how to do that outside of a regular expression. Sadface.
Timings:
test elapsed n replications generateIndex16(n) 20.77200 2000 50 generateIndex15(n) 23.30100 2000 50 generateIndex14(n) 27.27500 2000 50 generateIndex13(n) 33.09300 2000 50 generateIndex12(n) 35.31344 2000 50 generateIndex11(n) 36.32900 2000 50
Solutions Comparisons For Different N
I like ggplot2 charts and so ran my computer overnight to generate data for the speed performance of the last several solutions over different N:
Final Thoughts
I’m pretty sure that any more speed improvements will come from some or all of the follwing:
- doing the heavy lifting in a compiled language and interfacing with R
- running in parallel (I actually got this to work on linux by replacing lapply with mclapply from the parallel R package but the downside was that one has to use much more memory for larger values of N, plus it’s only works in serial fashion on Windows
- working out an efficient way of replacing a fixed positioned substring with a string of equal or great length
- compiling the function into R bytecodes using the compiler package function cmpfun
It would also be interesting to profile the memory usage of each funciton.
This was a fun challenge – if you find some spare time why not try your hand at it, you might come up with something even better! 🙂
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.