Code Optimization: One R Problem, Ten Solutions – Now Eleven!
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Earlier this year I came across a rather interesting page about optimisation in R from rwiki. The goal was to find the most efficient code to produce strings which follow the pattern below given a single integer input n:
# n = 4 [1] "i001.002" "i001.003" "i001.004" "i002.003" "i002.004" "i003.004" # n = 5 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i002.003" "i002.004" "i002.005" "i003.004" [9] "i003.005" "i004.005" # n = 6 [1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i002.003" "i002.004" "i002.005" [9] "i002.006" "i003.004" "i003.005" "i003.006" "i004.005" "i004.006" "i005.006"
From this we can see that the general pattern for n is:
# pseudo code
FOR j = 1 to n-1
FOR k = j+1 to n
PRINT "i" + format(j, digits = 3) + "." + format(k, digits = 3)
k = k + 1
END FOR
j = j + 1
END FOR
# this produces (n-1)*n/2 strings.
It is rather heart warming to go though that rwiki page and see how we can sequentially optimise the algorithm in R to more efficiently produce the desired string sequence. I learned quite a lot from this page about R and how fun these types of challenges can be!
Looking at the tenth solution, it achieves its speed by recognising that there are a total of n unique strings (e.g. “001″, “002″) to the pattern. These can be assigned to a character vector which we subsequently call by working out the sequence of indices we require and passing them to the character vector:
generateIndex10 <- function(n) {
s <- sprintf("%03d", seq_len(n));
paste(
"i",
rep(s[1:(n-1)], (n-1):1),
".",
unlist(lapply(2:n, function(k) s[k:n]), use.names=FALSE),
sep=""
)
}
generateIndex10(7)
[1] "i001.002" "i001.003" "i001.004" "i001.005" "i001.006" "i001.007" "i002.003" "i002.004"
[9] "i002.005" "i002.006" "i002.007" "i003.004" "i003.005" "i003.006" "i003.007" "i004.005"
[17] "i004.006" "i004.007" "i005.006" "i005.007" "i006.007"
Playing around with the solution above, I noticed that a speed up would be possible if the following were implemented:
repon a string vector is slower than simply usingrep.intto work out the indices first and then passing those into the character vector.- Initialise the vectors to the correct size. This seems to produce a speed up though I’m not sure why it would in a vectorised solution. I could see the benefit if I were creating the vectors using an explicit R loop and concatenating the results but that is not the case here.
- The functions could be compiled into bytecodes using the new compiler package (NB given that tenth solution is from 2007, this option would not have been available back then).
- This problem could be set up to run in parallel. I didn’t do this because I’m still within running distance of being an R noob
![]()
Given the first three points above, this is the solution I came up with:
generateIndex11 <- function(n) {
# initialise vectors
len <- (n-1)*n/2
s <- vector(mode = "character", length = n)
ind.1 <- vector(mode = "integer", length = len)
ind.2 <- vector(mode = "integer", length = len)
# set up strings
s <- sprintf("%03d", seq_len(n))
# calculate indices
ind.1 <- rep.int(1:(n-1), (n-1):1)
ind.2 <- unlist(lapply(2:n, ":", n), use.names = FALSE)
# paste strings together
return(paste("i", s[ind.1], ".", s[ind.2], sep = ""))
}
But is it faster? We can benchmark it and see!
# load packages
library(compiler)
library(rbenchmark)
# compile functions
compiled_generateIndex10 <- cmpfun(generateIndex10)
compiled_generateIndex11 <- cmpfun(generateIndex11)
# set size
n = 5000
# run benchmark
benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative",
replications = 10)
###--- TIMINGS ---###
test replications elapsed relative
4 compiled_generateIndex11(n) 10 51.46 1.000000
3 generateIndex11(n) 10 51.67 1.004081
2 compiled_generateIndex10(n) 10 54.35 1.056160
1 generateIndex10(n) 10 61.20 1.189273
So it seems that there is a speed up between generateIndex10 and generateIndex11. If we call the benchmark for different values of n and plot using the ggplot2 pack we get the following:
library(ggplot2)
get_bm <- function(n) {
gc()
df = benchmark(generateIndex10(n), compiled_generateIndex10(n), generateIndex11(n), compiled_generateIndex11(n),
columns = c("test", "elapsed"),
order = "relative",
replications = 10)
df$n = n
return(df)
}
df.list <- lapply(seq(1000, 9000, by = 250), get_bm)
df <- do.call("rbind", df.list)
ggplot(df, aes(x=n, y=elapsed, colour=test, shape=test)) + geom_point() + stat_smooth() + xlab("N") + ylab("Time (seconds)")
Which produces the graph below. I don’t know what’s happening at the extreme values but I suspect that it has something to do with memory because on my 8GB machine, the RAM was mostly at 99% for the largest values of n. I also like how stable the compiled versions of the functions are:

In conclusion, whist this compiled eleventh solution is faster for larger values of n, for values below 1000 the tenth solution is sometimes faster.
Writing this post, I noticed something and actually came up with an even faster twelfth solution. When I get time in the next couple of days I’ll make a post about it and also submit it to rwiki.
Thanks for reading, this was my first proper R blog post, constructive comments are most welcome! Also, please kindly bare in mind that I am not an R expert but rather just trying to improve my knowledge of the language
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.