Site icon R-bloggers

Improving Adaboosting with decision stumps in R

[This article was first published on R – My Data Atelier, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined.

The process by which these weak learners are combined is though more complex than simply averaging results. Very briefly, Adaboosting training process could be described as follows:

For each weak learner:

1) Train weak learner so that the weighted error sum of squares is minimised

2) Update weights, so that correctly classified cases have their weight reduced and misclassified cases have their weights increased.

3) Determine weak learner’s weight, i.e., the total contribution of the weak learner’s result to the overall score. This is known as alpha and is calculated as 0.5 * ln((1- error.rate)/error.rate))

As the weight is updated on each iteration, each weak learner will tend to focus more on those cases that were misclassified on previous instances.

For further information about Adaboosting Algorithm, this Schapire’s article provides a very useful high-level guidance http://rob.schapire.net/papers/explaining-adaboost.pdf

Decision stumps as weak learners

The most common weak learner used in Adaboosting is known as Decision Stump and consists basically on a decision tree of depth 1, i.e., a model that returns an output based on a single condition, which could be summarised as “If (condition) then A else B”.

ada package in R

Although the implementation provides very good results in terms of model performance, “ada” package has two main problems:

 

Introducing adaStump

Considering that each stump consists just of a condition, two probability estimates (one when the condition is TRUE and the other when it’s FALSE) and a multiplier (alpha), I thought it could be much better, both in terms of size and timings, even just using R.

For that reason I developed adaStump, which is a package that takes advantage of ada’s excellent training process implementation, but creates a smaller output object and generates faster predictions.

An Example

Before running the example, you need to download the package from my Github repo. In order to do that, you need to have devtools installed


install.packages("devtools")
devtools::install_github("nivangio/adaStump")

After the package is installed, let’s start with the example! In this case, the letters dataset is loaded, its variable names assigned and the variable isA is created, which gets a value 1 if true and 0 if false. That will be the class we will predict.


library(adaStump)

#Load Letters Dataset

letters.data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data", header = F)

names(letters.data) <- c("lettr", "xbox", "ybox", "width",
                          "high", "onpix", "xbar", "ybar",
                          "x2bar", "y2bar", "xybar", "x2ybr",
                         "xy2br", "xege", "xegvy", "yege", "yegvx")

#Create a binary class

letters.data$isA <- ifelse(letters.data$lettr == "A","Yes","No")

#Create model formula. Exclude letter var for obvious reasons

antec <- paste0(setdiff(names(letters.data),c("isA","lettr")), collapse = "+")

fla <- as.formula(paste("isA", antec, sep = "~"))

Now we are going to create two identical models. As AdaBoosting is usually ran with bagging, the bag.frac parameter is set to 1 so that all the cases are included. As what we want to demonstrate is that we can get the same results with much less memory usage and in less time, we need to ensure that both models are identical.


#ada model fit passing the corresponding control function to create Stumps

fit_with.ada <- ada(formula = fla,data = letters.data, type = "real",
    control = rpart.control(maxdepth=1,cp=-1,minsplit=0,xval=0)
, iter = 40, nu = 0.05, bag.frac = 1)

#adaStump

fit_with.adaStump <- adaStump(formula = fla,data = letters.data, 
type = "real", iter = 40, nu = 0.05, bag.frac = 1) 

Let us take a look now at the object sizes:

 format(object.size(fit_with.ada), units = "KB")
 [1] "56006.3 Kb" 
format(object.size(fit_with.adaStump), units = "KB")
 [1] "4.7 Kb" 

Taking a look at these numbers we can definitely say that the first problem is solved. Besides, the size of ada objects depend on the bag fraction, the sample size and the amount of iterations while adaStump objects only depend on the amount of iterations. There is almost no scenario where the size of the adaStump object can turn into a resource problem even on old personal computers. Now, let’s see if the new model can also predict quicker:

 
> system.time(predictions_ada <- predict(fit_with.ada, letters.data,
 type = "prob")) 
user  system elapsed 
1.100   0.000   1.114 
> system.time(predictions_adaStump <- predict(fit_with.adaStump, letters.data)) user  system elapsed 
0.42    0.00    0.42  

Indeed 🙂 . Let us check if the results are the same. Contrary to ada, adaStump prediction method only returns the probability estimate of the class outcome (i.e., the probability that the class variable is 1). Its complement can be easily calculated by doing 1 – the vector obtained.

 > summary(abs(predictions_ada[,2] - predictions_adaStump))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.000e+00 0.000e+00 0.000e+00 8.391e-18 1.041e-17 1.110e-16 

Although there is some difference in some cases, it tends to be minimal. We can say now that we can produce almost the same outcome using 1% of the memory and more than 2.5 times faster. However, we can do better

Usually, AdaBoosting generates stumps based on the same condition. In this case, for example, we can see by accessing to the model frame (fit_with.adaStump$model), that the condition x2ybr >= 2.5 is repeated constantly. Of course, all these rows can be collapsed so that the condition is evaluated only once. For that purpose you can use pruneTree()


> fit_with.prunedadaStump <- pruneTree(fit_with.adaStump) 
> system.time(predictions_prunedadaStump <- predict(fit_with.prunedadaStump, letters.data))
 user  system elapsed
 0.124   0.000   0.121  

Now it is more than 9 times faster. As a final check, let us see if the results are the same

 > summary(abs(predictions_ada[,2] - predictions_prunedadaStump))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
6.489e-11 3.317e-10 7.411e-10 1.194e-09 8.965e-10 9.125e-09 

Although the difference between both methods is much greater in this case (there is a pre-calculation step where some decimals are discarded) it is still acceptable for most use cases. So, choosing to do the reduction or not is a decision based on the level of precision needed.

You can download the full script used in this post here

Final considerations

Some observations before closing this post. Consider this is a first release. I will be completing and crossing out items of this to-do list:

As usual, feel free to drop me a line with comments, observations, critics, etc


To leave a comment for the author, please follow the link and comment on their blog: R – My Data Atelier.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.