Site icon R-bloggers

SQL Server, Power BI, and R

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Sheri Gilley, Microsoft Senior Software Engineer

SQL Server 2016 has reached general availability recently and one of the top new capabilities it features is SQL Server R Services —  advanced analytics in-database with the R language.   R has also been integrated into Power BI, allowing you to create fully integrated visualizations with the power of the R language.  In this blog post I will show an example using R in SQL Server to create a model and  batch score testing data, then use Power BI Desktop to create visualizations of the scored data.

Rather than moving your data from the database to an external machine running R,  you can now run R scripts directly inside the SQL Server database – train your model inside the database with the full power of CRAN R packages plus the additional features of speed and scalability from the RevoScaleR package.  And once you have a model, you can perform batch scoring inside the database as well.

But what do you do when it comes to visualizing this newly scored data?  How would you create an interactive experience that supports sharing and embedding in applications? One great option is using PowerBi which gives you a quick and easy way to create visualizations that interact to help you explore your data.   It also easily allows you to share your dashboards and reports, and you can embed the PowerBI visuals in your applications. And PowerBI can access your SQL data.   

Here is an example of model building and scoring for Fraud Detection performed in SQL Server.  The data contains 200,000 online transactions, with a second spreadsheet containing  information for 9000 known fraud transactions.  This template  details the steps to prepare data, create a model and perform batch scoring.  The template includes R code inside stored procedures that allows you to train and save the model and score test data using the model. 

While this example covers a fairly small dataset, R Services can scale to billions of rows – learn more about performance tuning for these operations here.

Once the scoring is complete, you can access this data in either PowerBI or PowerBI Desktop to examine the scores. In PowerBI Desktop, use Get Data>SQL Server Database, and you can connect directly to a database table or can add a query to access the data you wish to visualize.

In this example the scored data is saved in the table sql_predict_score with just the account Id and the predicted scores, so we will add a query to join this with the rest of the input data.

This particular scoring was performed on test data so we also have the known values available to help us see how well the model has performed. We'll use these values in a few different displays to learn more about the scoring of the test data.

Note the first two bar charts – the first shows the known values, the second shows the predicted values. Since we are predicting a rare event (in our sample, only 372 of the 58,924 transaction) it would be difficult to view counts on the bar heights. So instead, the bar heights here represent the average transaction amount, an important attribute to track when dealing with fraud. Right away we see that the average transactions are slightly higher in the fraud transactions than the non-fraud.

While we can see some interesting information in this display, the real power (and fun) of Power BI is in the interactions. When you click on a bar in the graph, the rest of the display is filtered by that selection. For example, click on the first bar to display only the transactions predicted as fraud in all the other displays:

Likewise, click on the first bar of the second graph to view the displays for the 372 known fraud transactions.

In the first barchart the light bars show all transactions, while the darker bars in the middle now show just the fraud transactions. Note from the table that the miss-classified non-fraud transactions represent just 17 cases, while 355 cases were correctly classified In both instances, the average amount for fraud transactions are higher than the average transaction amounts for all data.

In Power BI Desktop, you can also create visualizations with R directly in the report – in this example I've used R to create both the Box & Whisker plot and the Density Plot using trellis. For example:

The integration of R into SQL Server 2016 and Power BI makes it easy for you to create and share rich visualizations with advanced insights and predictions for your business use-case. 

Editors Note: The Power BI Service now also supports reports and dashboards with R visuals.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.