Building a machine learning model with the MicrosoftML package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including:
- Linear (
rxFastLinear
) and logistic (rxLogisticRegression
) model functions based on the Stochastic Dual Coordinate Ascent method; - Classification/regression trees (
rxFastTrees
) and random forests (rxFastForests
) based on FastRank, an efficient implementation of the MART gradient boosting algorithm; - A neural network algorithm (
rxNeuralNet
) with support for custom, multilayer network topologies; and - One-class anomaly detection (
rxOneClassSvm
) based on support vector machines.
As the function names suggest, the implementations are tuned for speed: most use multiple CPUs, and some will even use the GPU (if available). Not all of the implementations scale to unlimited data sizes, however; all but the linear and logistic regression routines are bound by available RAM.
If you want to give these routines a try, the MIcrosoft R Server Tiger Team has prepared a walkthrough analyzing the famous NYC Taxi data set. Once you have access to Microsoft R Server (or Client), this R script walks you through the process of:
- Loading the MicrosoftML package
- Importing the NYC Taxi Data from SQL Server (it comes preinstalled on the Data Science Virtual Machine)
- Splitting the data into a test set and a training set, with the binary value “tipped” (whether or not the driver was tipped) as the response
- Fitting several predictive models: logistic regression, linear model,, fast forest, and neural network.
- Making predictions on the test data
- Evaluating model performance by comparing AUC (area under the ROC curve)
The ROC curves are shown below. As you'd expect the linear model performs poorly compared to the others, since it's being applied here to a binary variable.
To try it out yourself, follow the walkthrough linked below, which also provides instructions for running the logistic regression model in SQL Server Management Studio.
Microsoft R Server Tiger Team: Predicting NYC Taxi Tips using MicrosoftML
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.