Site icon R-bloggers

A first look at rxBTrees

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed environments.

The basic idea underlying the gbm is that weak learners may be combined in an additive, iterative fashion to produce highly accurate classifiers that are resistant to overfitting. While in theory any weak learner can be incorporated into a gbm, classification and regression trees, or decision trees as they are called in RRE, are particularly convenient. As Kuhn and Johnson (p 205) point out:

  1. They have the flexibility to be weak learners by simply restricting their depth
  2. Separate trees can be easily added together
  3. Trees can be generated very quickly

rxBTrees() scales gbm() by building the boosting algorithm around RRE’s rxDTree() decision tree implementation. As I described in a previous post, this algorithm scales to work with large data sets by building trees on histograms, a summary of the data, and not on the data itself. rxBTrees(), like rxDTree() is also a proper parallel external memory algorithm (PEMA) This means two things:

  1. It uses a “chunking” strategy to work with a block of data at a time that has been read from disk. Since there is no need to keep all of the data in memory at one time, rxBtrees() can build a gbm on arbitrarily large files.
  2. PEMA’s are parallel algorithms built on top of RRE’s distributed computing infrastructure. This allows them to take full advantage of the opportunities for parallelism implicit in the underlying hardware by setting up a “compute context”. In the example below, the compute context is set to local parallel to take advantage of the cores on my PC. RRE also allows establishing compute contexts that enable PEMAs to run in parallel on clusters.

The following function fits a rxBTrees() model to a training data subset of the a mortgage data set. rxBTrees() contains many input parameters. Some are common to all RevoScaleR PEMAs, others are inherited from rxDTree(), and still others such as nTree and learningRate have direct counterparts in R's gbm() function.  In this function call, I have explicitly called out parameters that pertain to fitting a gbm model

# Fit a Boosted Trees Mode on the training data set
form <- formula(default ~ houseAge + creditScore + yearsEmploy + ccDebt)
 
BT_model <- rxBTrees(formula = form, 
                     data= "mdTrain",
                     lossFunction = "bernoulli",
                     minSplit = NULL,    # Min num of obs at node before a split is attempted
                     minBucket = NULL,   # Min num of obs at a terminal node
                     maxDepth = 1,       # 2^maxDepth = number of leaves on tree
                     cp = 0,             # Complexity parameter
                     maxSurrogate = 0,   # Max surrogate splits retained, 0 improves perf
                     useSurrogate = 2,   # Determines how surrogates are used
                     surrogateStyle = 0, # Sets surrogate algorithm
                     nTree = 100,        # Num of trees = num of boosting iterations
                     mTry = NULL,        # Num vars to sample as a split candidate 
                     sampRate = NULL,    # % of observations to sample for each tree
                                         # default = 0.632
                     importance = TRUE,  # Importance of predictors to be assessed
                     learningRate = 0.1, # Also known as shrinkage
                     maxNumBins = NULL,  # Max num bins in histograms
                                         # default: min(1001,max(sqrt(num of obs)))
                     verbose = 1,        # Amount of status sent ot console
                     computeContext="RxLocalParallel",  # use all cores available in PC
                     overwrite = TRUE)   # Allow xdf output file to be overwritten

Some especially noteworthy parameters are maxDepth, maxNumBins and computeContext. maxDepth accomplishes the same purpose as  interaction.depth in the gbm() function. However, they are not the same parmeter. Setting interaction.depth = 1 produces stump trees with two leaves each because the number of leaves = interaction.depth + 1. In rxBTRees(), stumps are produced by setting maxDepth = 1. But, this is because 2maxDepth gives the number of leaves.

MaxNumBins determines the number of bins to be used in making histograms of the data. If anyone ever wanted to compare gbm() directly with rxBTrees() on a small data set for example, setting MaxNumBins to the number of points in the data set would mimic gbm(). Note, however, that this would saddle rxBTrees() with all of the overhead of computing histograms while receiving none of the benefits.

The computeContext parameter is the mechanism for connecting a PEMA to the underlying hardware and that enables functions such as rxBTrees() to run in parallel on a various clusters using distributed data. Some of the details of how this happens have been described in a previous post, but I hope to be able to follow up this post with an example of rxBTrees() running on Hadoop in the not too distant future.

The data set used to build the model above is available on Revolution Anaytics' data set web page both as an xdf file and as multiple csv files. The following command shows what the data looks like. 

> rxGetInfo(mortData,getVarInfo=TRUE,numRows=3)

File name: C:DATAMortgage DatamortDefaultmortDefault2.xdf 
Number of observations: 1e+07 
Number of variables: 6 
Number of blocks: 20 
Compression type: none 
Variable information: 
Var 1: creditScore, Type: integer, Low/High: (432, 955)
Var 2: houseAge, Type: integer, Low/High: (0, 40)
Var 3: yearsEmploy, Type: integer, Low/High: (0, 15)
Var 4: ccDebt, Type: integer, Low/High: (0, 15566)
Var 5: year, Type: integer, Low/High: (2000, 2009)
Var 6: default, Type: integer, Low/High: (0, 1)
Data (3 rows starting with row 1):
  creditScore houseAge yearsEmploy ccDebt year default
1         615       10           5   2818 2000       0
2         780       34           5   3575 2000       0
3         735       12           1   3184 2000       0

In building the model, I divided the mortgage default data randomly into a training and test data sets. Then, I fit an rxBTrees() model on the training data and used the test data to compute the ROC curve shown below.

 

Along the way, the model also produced the out of bag error versus number of trees plot which shows that for this data set one could probably get by with only building 70 or 80 trees.

The code I used to build the model is available here: Download Post_model. If you do not have Revolution R Enterprise available to you at your company, you can try out this model for yourself in the AWS cloud by taking advantage of Revolution Analytics' 14 day free trial.

Take the first step towards building a gbm on big data.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.