Microsoft Analytics in 2016

Allan Engelhardt

6 years ago

[This article was first published on CYBAEA on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you had asked me two years ago if Microsoft was a serious vendor for data science and analytics infrastructure and tools, I would have laughed. At the time their offering seemed to me to consist of Excel against SQL Server. There is nothing really wrong (or exciting) about SQL Server, but friends don’t let friends use Excel for data analysis or indeed for anything that matters at all, so that whole proposition was a non-starter. But things have moved on, so how does Microsoft stack up now (mid-2016)? (You can skip right to our conclusions if you are impatient.)

In 2015 Microsoft acquired Revolution Analytics (MSFT, Revo) which gave them a serious and scalable tool for data science, visualization, and statistical analysis. This is based on the open source R implementation which has massive mind-share making it easily the most popular serious data science tool, but adds easy scalability and enterprise features.

And they announced integration with Apache Spark (Microsoft Azure HDInsight), Azure Machine Learning, and SQL Server, which suddenly looks like a serious platform. Add a 75% reduction in the price of Power BI (and a reasonable free version) and various other enhancements, and it is looking really interesting.

So is it ready for prime-time? I had the opportunity this week to spend four days with the Microsoft teams to find out. We focused on the R environments and integrations and received plenty of hands-on experience.

Solutions components

We covered the following solutions components:

The scalable rx* model and data processing functions using the XDF file format.
The integration with the dplyr package through dplyrXdf for streaming data manipulation.
The Apache Spark / HDInsight integration: using essentially the same syntax, but automatically distributed over a Spark cluster.
The Microsoft SQL Server integration: Using the same rx* functions against database tables either from the client or running on the SQL Server.
Running R inside the SQL Server using stored procedures (and sp_execute_external_script).

1. Scalable model and data processing functions

This is the venerable technology from Revolutions. It provides functions that work transparently on chunks of data which means you have no issues of memory limitations. For data processing the key function is rxDataStep which does all the transformations you would expect and is supported by rxMerge for merging data and rxImport for importing data files. This all works well and was the core offering from Revolutions, along with the XDF file storage format which allows efficient processing of large data.

For data exploration you have comprehensive summaries (including approximate median), contingency tables, and cross tabulations, and there is a small but useful selection of data visualization functions. Models are currently limited to

Linear Regression Models (rxLinMod)
Logistic Regression (rxLogit)
Generalized Linear Models (rxGlm)
Classification and Regression Trees (rxDTree)
Stochastic Gradient Boosted Decision Trees (rxBTrees)
Classification and Regression Decision Forests (rxDForest)
K-Means Clustering (rxKmeans)
Naive Bayes Classifier (rxNaiveBayes)

More model types are apparently in the works. I hope to see association rules (the Apriori and Eclat algorithms at least) and anomaly detection (eg one-class support vector machines) at least. I hear rumours that support vector machines (SVM) are coming soon, and of course they provide the infrastructure for you to implement your own models.

Nothing new here compared to the Revolutions offering when you are working on XDF files. What is new is the seamless integration with Spark clusters meaning that you can easily run any of these models on an efficient distributed environment.

The code we went through is here: MRS for SAS Users

2. Streaming data manipulation the `dplyr` way

This is the dplyrXdf package which was announced about three months ago. At the moment you have to install it from the dplyrXdf GitHub page.

I was skeptical at the start but quickly convinced. If you want to quickly compare the syntax you can see the course material here. So you get the dplyr syntax, that’s nice. But you get it distributed on a Spark cluster with (almost) no change of code: that’s awesome! (And you don’t have to keep track of your temporary files: that is sometimes useful.)

3. Microsoft R with Spark on HDInsight

It is easy to set up (though not quite as easy as Amazon EMR yet); the instructions are here but note that this is rapidly changing. The quick summary is that it works really well. The key function here, and for the SQL Server integration, is rxSetComputeContext which is a general function that defines where your computation is taking place. On Spark you have many options but you end up with code along the lines of:

computeContext <- RxSpark(consoleOutput = TRUE,
                          nameNode = myNameNode,
                          port = myPort,
                          executorCores = 6, 
                          executorMem = "3g", 
                          executorOverheadMem = "3g", 
                          persistentRun = TRUE, 
                          extraSparkConfig = "--conf spark.speculation=true")

rxSetComputeContext(computeContext)

linmod <- rxLinMod(tip_pct ~ trip_distance, 
                   data = taxi_xdf, blocksPerRead = 2)

There is some good documentation here: here: Overview of R Server on HDInsight

Bottom line: it is easy to use, it works, it scales. But you still have to write good R code, of course.

4. Microsoft R and SQL Server

For an overview and to really understand the workflow for this and the following component, this document is the best: Data Science End-to-End Walkthrough.

But briefly you can work in two modes. The first is a transparent use of SQL tables as yet another XDF compatible data source, all processed in chunks and on your local machine. Works well, with the obvious limitations of data type compatibility etc. The downside is that all the data has to move from the server to the R instance. But the second mode will fix that: with a suitable invocation of our old friend rxSetComputeContext you can move the computation to the database server.

At the moment, at least, the compute context interface is rather clunky and you end up with lots of local files that created as part of the communications with the server. (See the shareDir= argument of the RxInSqlServer function.) This is ugly and error-prone.

Currently this only really works with the (Enterprise) SQL Server and not, for example, with the Azure SQL service. Very much work in progress, but give it time.

5. Running R as stored procedure in SQL Server

For its final trick, SQL Server is able to run R directly in the server as a stored procedure or similar. The key function is sp_execute_external_script which you can include in stored procedures and similar.

In a typical use case, the analyst builds the model using the rx* model functions described above using one of the standard user interfaces and appropriate compute context. She then stores the (serialized) model in the database and the DBA sets up a stored procedure to calculate the scores for new (eg in a trigger) or changed (eg scheduling a stored procedure) data, or when the model changes. In other words, this is mainly a deployment option. It works well.

Conclusions

There is much to like here. Microsoft is building an infrastructure and services platform that is robust and scalable, and with a good or great partner ecosystem. They are one of the most interesting technology companies in this space, and I am not the only one who is excited for their vision; Gartner says something similar.

Who is it for? This is a technical platform for technically savvy data scientists. If you want something easy to use, go to Oracle Data Miner where everything is simple and works almost as if by magic, but you sacrifice some control for this convenience. The Oracle set of parallel algorithms is more comprehensive than the rx* models described above which in the current implementation in particular lack association rules and integration into text mining, and the integration to R in the Oracle database appears more mature and complete.

But Oracle does not have a simple and consistent interface that works across files, databases, and Spark clusters. And in particular they do not have anything really comparable to Azure and the ecosystem of integrated services you find there to help you build your solution fast and cost-effective. In that sense the nearest competitor is Amazon Web Services which has even more integrated services, many of which are more mature. (For example, it is easy to have R Studio Server pre-configured on the EMR Spark service at AWS but a bit of a hack is needed on Azure.) But it does not have the same integration with Microsoft R Server for the large data set models so you will have some more work to use that.

Microsoft is by no means the only play in town, but they are certainly credible right now for your data and data science infrastructure and they are an interesting company to watch.

To leave a comment for the author, please follow the link and comment on their blog: CYBAEA on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.