A Data Scientist’s Perspective on Microsoft R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Lixun Zhang, Data Scientist at Microsoft
As a data scientist, I have experience with R. Naturally, when I was first exposed to Microsoft R Open (MRO, formerly Revolution R Open) and Microsoft R Server (MRS, formerly Revolution R Enterprise), I wanted to know the answers for 3 questions:
- What do R, MRO, and MRS have in common?
- What’s new in MRO and MRS compared with R?
- Why should I use MRO or MRS instead of R?
The publicly available information on MRS either describes it at a high level or explains the specific functions and the underlying algorithms. When they compare R, MRO, and MRS, the materials tend to be high level without many details at the functions and packages level, with which data scientists are most familiar. And they don’t answer the above questions in a comprehensive way. So I designed my own tests (and the code behind the tests is available on GitHub). Below are my answers to the three questions above. MRO has an optional MKL library and unless noted otherwise the observations hold true, whether MKL is installed on MRO or not.
What do R, MRO, and MRS have in common?
After installing R, MRO, and MRS, you'll notice that everything you can do in R can be done in MRO or MRS. For example, you can use glm() to fit a logistic regression and kmeans() to carry out cluster analysis. As another example, you can install packages from CRAN. In fact, a package installed in R can be used in MRO or MRS and vice versa if the package is installed in a library tree that's shared among them. You can use the command .libPaths() to set and get library trees for R, MRO and MRS. Finally, you can use your favorite IDEs such as RStudio and Visual Studio with RTVS for R, MRO or MRS. In other words, MRO and MRS are 100% compatible with R in terms of functions, packages, and IDEs.
What’s new in MRO and MRS compared with R?
While everything you do in R can done in MRO and MRS, the reverse is not true, due to the additional components in MRO and MRS. MRO allows users to install an optional math library MKL for multithreaded performance. This library shows up as a package named “RevoUtilsMath” in MRO.
MRS comes with more packages and functions than R. From the package perspective, most of the additional ones are not on CRAN and are available only after installing MRS. One such example is the RevoScaleR package. MRS also installs the MKL library by default. As for functions, MRS has High Performance Analysis (HPA) version of many base R functions, which are included in the RevoScaleR package. For example, the HPA version of glm() is rxGlm() and for kmeans() it is rxKmeans(). These HPA functions can be used in the same way as their base R counterparts with additional options. In addition, these functions can work with a special data format (XDF) that's customized for MRS.
Why should I use MRO or MRS instead of R?
In a nutshell, MRS solves two problems associated with using R: capacity (handling the size of datasets and models) and speed. And MRO solves the problem associated with speed.
The following table summarizes the performance comparisons for R, MRO, and MRS. In terms of capacity, using HPA in MRS increases the size of data that can be analyzed. From the speed perspective, certain matrix related base R functions can perform better in MRO and MRS than base R due to MKL. The HPA functions in MRS perform better than their base R counterparts for large datasets. More details on this comparison can be found in the notebook on GitHub.
It should be noted that while there are packages such as “bigmemory” and “ff” that help address some of the big data problems, they were not included in the benchmark tests.
The takeaway for data scientists
For data scientists trying to determine which of these platforms should be used under different scenarios, the following table can be used as a reference. Depending on the amount of data and the availability of MRS's HPA functions, the table summarizes scenarios where R, MRO, and MRS can be used. It can be observed that whenever R can be used, MRO can be used with the additional benefit of multi-thread computation for certain matrix related computations. And MRS can be used whenever R or MRO can be used and it allows the possibility of using HPA functions that provide better performance in terms of both speed and capacity.
Follow the link below for my in-depth comparison of R, MRO and MRS.
Lixun Zhang: Introduction to Microsoft R Open and Microsoft R Server
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.