Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My heart sinks a little when I check on my laptop in the morning and the computation I started the night before still hasn’t finished. Even when the data I’m playing with isn’t particularly…. large… (I’m not going to say it), I have a knack for choosing expensive algorithms late at night.
Because of my reluctance to get remote servers and tmux involved at ungodly hours, I’ve been exploring ways to speed up my code without too much trouble (I take the first of the Three virtues of programming very seriously) and without having to translate from my native tongue: R.
- writing idiomatic R
- using Rcpp or inline
- distribute processing over machines
- taking advantage of multiple cores
taking advantage of the way R stores numerics in contiguous blocks of RAM and uses C code for vectorized functions
Identify bottlenecks and replace time critical/innermost-loops with C or C++ code
It’s relatively easy to set up a computing cluster using Amazon Web Services and use the package snow, or a similar package.
Besides for writing idiomatic R (which is the subject of a future post), I’ve found that the easiest way to get a speedup is to use parallel processing on a single machine using the multicore package. This usually gives me a very substantial speedup, even if the code is less than elegant, as we’ll see.
Testing it:
Just for the sake of choosing a problem that exhibits combinatorial explosion, let’s use the multicore package to get the mean distance between all airports in the U.S.
If we think of a network of air-travel as a complete graph where the vertices are airports and a bee-line between any two airports as an edge, we want to find the average length of the edges. The number of edges in a complete graph is
The dataset of US airport codes and their longitude and latitude is available here. There are 13,429 airport codes, which makes the total number of combinations 90,162,306. Clearly, extreme measures have to be taken to make this a tractable problem on commodity hardware.
Since the Earth isn’t flat (it’s not even, strictly speaking, a perfect sphere) the distance between longitude and latitude degrees is not constant. Luckily, there’s a great package, Imap, to handle conversion from two long/lat points to miles or kilometers.
The code:
library(multicore) library(Imap) calc.distance.two.rows <- function(ind1, ind2){ return(gdist(air.locs[ind1, 3], air.locs[ind1, 2], air.locs[ind2, 3], air.locs[ind2, 2], units="km")) } sum.of.distances <- function(a.mat){ return(sum(apply(a.mat, 2, function(x){ calc.distance.two.rows(x[1], x[2]) }))) } # read airport long/lat data set air.locs <- read.csv("airportcodes.csv", stringsAsFactors=FALSE) # read command-line args args <- commandArgs(TRUE) # read the number of airports to use (sample size) from the command-line smp.size <- as.numeric(args[1]) # choose a random sample of airports sampling <- sample((1:nrow(air.locs)), smp.size) # shrink dataframe air.locs <- air.locs[sampling, ] # create a matrix of unique subsets of indices from the # data frame that stores the airport information combos <- combn(1:nrow(air.locs), 2) num.of.comps <- ncol(combos) # use single core single <- function(){ the.sum <- sum.of.distances(combos) result <- the.sum / num.of.comps return(result) } # use two cores mult <- function(){ half <- floor(num.of.comps/2) f1 <- parallel(sum.of.distances(combos[, 1:half])) f2 <- parallel(sum.of.distances(combos[, (half+1):num.of.comps])) the.sum <- sum(as.vector(unlist(collect(list(f1, f2))))) result <- the.sum / num.of.comps return(result) } # compare the execution times (in time elapsed) perform <- function(){ first <- system.time(res1 <- single()) second <- system.time(res2 <- mult()) cat(smp.size); cat(",first,"); cat(first[3]); cat(","); cat(res1); cat("\n"); cat(smp.size); cat(",second,"); cat(second[3]); cat(","); cat(res2); cat("\n"); } perform()
Then, I wrote a wrapper program in python that compared the speeds using sample sizes from 10 to 800 in increments of ten.
The results:
The parallelized solution is much faster but it is not twice as fast, as one might expect. This is because, not only is there overhead involved in the process of forking and collecting the result, but part of the program (namely the reading of the dataset) is not parallelized. (For more information, check out Amdehl’s Law).
Some cursory curve-fitting suggests that the single core solution’s execution time is fairly well-modeled by the function
After about 1 hour, I psychosomatically smelled melting plastic from my laptop so I called off the actual distance calculation. But repeated trials with sample sizes of 300 suggested that the sampling distribution of the sample mean (whew) had a mean of about 1,874 km and a standard deviation of 64 km (this is the standard error of the mean). This would suggest that the mean distance between any two US airports is about 1,874 km +- 125 (543 miles +- 78) with a standard deviation of around 1109 km (689 miles).
As this example shows, even a kludge-y solution that uses more than one thread can give you pretty substantial speed increases. In future posts, I plan to cover applying Rcpp to this problem and setting up a cluster to perform this calculation. In the meantime, I’ll still be starting things at night and regretting it in the morning.
Footnote: If you’re interested in this topic, I suggest you check out this site from CRAN. Also, everything I know about parallel R, I learned from this book.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.