Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Thomas Dinsmore
Regular readers of this blog may be familiar with our ongoing effort to benchmark Revolution R Enterprise (RRE) across a range of use cases and on different platforms. We take these benchmarks seriously at Revolution Analytics, and constantly seek to improve the performance of our software.
Previously, we shared results from a performance test conducted by Allstate. In that test, RRE ran a GLM analysis in five minutes; SAS took five hours to complete the same task. A reader objected that the test was unfair because SAS ran on a single machine, while RRE ran on a five node cluster. It's a fair point, except that given the software in question (PROC GLM in SAS/STAT) the performance would be the same on five nodes or a million nodes, since PROC GLM can scale up but not out.
Arguing that the Allstate benchmark was "apples to oranges", SAS responded by publishing its own apples to orange benchmark. In this benchmark, SAS demonstrated that its new HPGENSELECT procedure is very fast when it runs on a 144 node grid with 2,304 cores. As noted in the paper, this performance is only possible if you license more software, since HPGENSELECT can only run in Distributed mode if the customer licenses SAS High Performance Statistics.
We will be happy to stipulate that PROC HPGENSELECT runs faster on 2,304 cores than RRE on 20 cores.
As a matter of best practices, software benchmarks should run in comparable hardware environments, so that we can attribute performance differences to the software alone and not to differences in available computing resources. Consequently, we engaged an outside vendor with experience running SAS in clustered environments to perform an "apples to apples" benchmark of RRE vs. SAS. The consultant used a clustered computing environment consisting of five four-core commodity servers (with 16G RAM each) running CentOS, Ethernet connections and a separate NFS Server.
We tested RRE 7 versus SAS Release 9.4, with Base SAS, SAS/STAT and SAS Grid Manager. (We did not test with SAS High Performance Statistics because we could find no vendors with experience using this new software. We note that more than two years into General Availability, SAS appears to have no public reference customers for this software.) In our experience, when customers ask how we perform compared to SAS, they are most interested in how we compare with the SAS software they already use.
To test Revolution R Enterprise ScaleR, we first deployed IBM Platform LSF and Platform MPI Release 9 on the grid, then installed Revolution R Enterprise Release 7 on each node. SAS Grid Manager uses an OEM version of IBM Platform LSF that cannot run concurrently with the standard version from IBM, so we configured the environment and ran the tests sequentially.
To simplify test replication across different environments, we used data manufactured through a random process. The time needed to manufacture the data is not included in the benchmark results. Prior to running the actual tests, we loaded the randomized data into each software product’s native file system: for SAS, a SAS Data Set; for Revolution R Enterprise, an XDF file.
Although we have benchmarked Revolution R Enterprise on data sets as large as a billion rows, typical data sets used by even the largest enterprises tend to be much smaller. We chose to perform the tests on wide files of 591 columns and row counts ranging from 100,000 to 5,000,000, file sizes that represent what we consider to be typical for many analysts. We also ran scoring tests on “narrow” files of 21 columns with row counts ranging up to 50,000,000.
Rather than comparing performance on a single task, we prepared a list of multiple tasks, then wrote programs in SAS and RRE to implement the test. Readers will find the benchmarking scripts here, on Git together with a script to produce the manufactured data.
To implement a fair test, we asked the SAS consultant to review the SAS programs and enable them for best performance in the clustered computing environment.
Detailed results of the benchmark test are shown here, in our published white paper
- RRE ran the tasks forty-two times faster than SAS on the larger data set
- RRE outperformed SAS on every task
- The RRE performance advantage ranged from 10X to 300X
- The RRE advantage increased when we tested on larger data sets
- SAS’ new HP PROC, where available, only marginally improved SAS performance
We invite readers to use the scripts in your own environment; let us know the results you achieve.
Revolution Analytics Whitepapers: Revolution R Enterprise: Faster Than SAS
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.