SAS Big Data Analytics Benchmark (Part One)

Thomas W DInsmore

9 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Thomas Dinsmore

On April 26, SAS published on its website an undated Technical Paper entitled Big Data Analytics: Benchmarking SAS, R and Mahout. In the paper, the authors (Allison J. Ames, Ralph Abbey and Wayne Thompson) describe a recent project to compare model quality, product completeness and ease of use for two SAS products together with open source R and Apache Mahout.

Today and next week, I will post a two-part review of the SAS paper. In today's post I will cover simple factual errors and disclosure issues; next week's post will cover the authors' methodology and findings.

Mistakes and Errors

This section covers simple mistakes by the authors.

(1) In Table 2, the authors claim to have tested Mahout "7.0". I assume they mean Mahout 0.7, the most current release.

(2) In the "Overall Completeness" section, the authors write that "R uses the change in Akaike's Information Criteria (AIC) when it evaluates variable importance in stepwise logistic regression wheras SAS products use a change in R-squared as a default." This statement is wrong. SAS products use the R-squared to evaluate variable importance in stepwise linear models, but not for logistic regression (where the R-squared concept does not apply). Review of SAS documentation confirms that SAS products use the Wald Chi-Square to evaluate variable importance in stepwise logistic.

(3) Table 3 in the paper states that R does not support ensemble models. This is incorrect. See, for example, these packages:

(4) The "Overall Modeler Effort" section includes this statement: "Bewerunge (2011) found that R could not model a data set larger than 1.3 GB because of the object-oriented programming environment within R." The cited paper (linked here) makes no general statements about R, it simply notes the capacity of one small PC and does not demonstrate a link between R's object-oriented approach and its use of memory. The authors fail to state that in Bewerunge's tests R ran faster than SAS in every test where it was able to run, and that Bewerunge (a long-time SAS Alliance Partner) drew no conclusions about the relative merits of SAS and R.

Disclosure Issues

Benchmarking studies should provide sufficient information about how the testing was performed; this makes it possible for readers to make informed decisions about how well the results generalize to everyday experience. For tests of model quality, publishing the actual data used in the benchmark ensures that the results are replicable.

As we know from the debate over the Reinhart-Rogoff findings, even the best-trained and credentialed individuals can commit simple coding errors. We invite the authors to make the data used in the benchmark study available to the SAS and R communities.

In addition, we think that additional disclosures by the authors will help readers evaluate the methodology and interpret findings from the paper. These include:

(1) Additional detail about the testing environment. I'll remark on the obvious differences in the hardware provisioning in next week's post, but for now I will simply note that the HPA environment described in the paper does not appear to match any existing Greenplum production appliances;

(2) Actual R packages used for the benchmark;

(3) Size of the data sets (in Gigabytes);

(4) Actual sample sizes for the training and validation sets for each method, together with more detail about the sampling methods used;

(5) Details of the model parameter settings used for each method and product;

(6) The value of "priors" used for each model run (which alone may explain the observed differences in event precision);

(7) In the results tables, detailed model quality statistics for each test, including sensitivity, specificity, precision and accuracy, the actual confusion matrices and method-specific diagnostics;

(8) Detailed model quality tables for the Marketing and Telecom problems, which are not disclosed in the paper;

We invite readers to review the paper and share your thoughts in the Comments section below.

Derek Norton, Joe Rickert, Bill Jacobs, Mario Inchiosa, Lee Edlefson and David Smith all contributed to this post.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.