Site icon R-bloggers

Tuning xgboost in R: Part I

[This article was first published on R – insightR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Gabriel Vasconcelos
Before we begin, I would like to thank Anuj for kindly including our blog in his list of the top40 R blogs! Check out the full list at his page, FeedSpot!

Introduction

Tuning a Boosting algorithm for the first time may be a very confusing task. There are so many parameters to choose and they all have different behaviour on the results. Also, the best choice may depends on the data. Every time I get a new dataset I learn something new. A good understanding of classification and regression trees (CART) is also helpful because we will be boosting trees, you can start here if you have no idea of what a CART is.

My favourite Boosting package is the xgboost, which will be used in all examples below. Before going to the data let’s talk about some of the parameters I believe to be the most important. These parameters mostly are used to control how much the model may fit to the data. We would like to have a fit that captures the structure of the data but only the real structure. In other words, we do not want the model to fit noise because this will be translated in a poor out-of-sample performance.

Example

The data will be generate from the following equation:

where , , . The number of variables, , will be set to 10 and the number of instances to 1000.

The experiment will be to change each Boosting parameter keeping all the others constant to try to isolate their effects. The standard model will have the following parameters:

I will change each of these parameters to the values in the code. We will analyse the convergence and the root mean squared error (RMSE) in the test sample. The code below will prepare everything to run the models.

library(xgboost)
library(ggplot2)
library(reshape2)
library(Ecdat)

set.seed(1)
N = 1000
k = 10
x = matrix(rnorm(N*k),N,k)
b = (-1)^(1:k)
yaux=(x%*%b)^2
e = rnorm(N)
y=yaux+e

# = select train and test indexes = #
train=sample(1:N,800)
test=setdiff(1:N,train)

# = parameters = #
# = eta candidates = #
eta=c(0.05,0.1,0.2,0.5,1)
# = colsample_bylevel candidates = #
cs=c(1/3,2/3,1)
# = max_depth candidates = #
md=c(2,4,6,10)
# = sub_sample candidates = #
ss=c(0.25,0.5,0.75,1)

# = standard model is the second value  of each vector above = #
standard=c(2,2,3,2)

# = train and test data = #
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]

eta

We will start analysing the eta parameter. The code below estimates Boosting models for each candidate eta. First we have the convergence plot, which shows that bigger values of eta converge faster. However, the train RMSE just below the plot shows that faster convergence does not translate into good out-of-sample performance. Smaller values of eta like 0.05 and 0.1 are the ones that produce smaller errors. My opinion is that in this case 0.1 gives us a good out-of-sample performance with acceptable convergence speed. Note that the result for eta=0.5 and 1 are really bad compared to the others.

set.seed(1)
conv_eta = matrix(NA,500,length(eta))
pred_eta = matrix(NA,length(test), length(eta))
colnames(conv_eta) = colnames(pred_eta) = eta
for(i in 1:length(eta)){
  params=list(eta = eta[i], colsample_bylevel=cs[standard[2]],
              subsample = ss[standard[4]], max_depth = md[standard[3]],
              min_child_weigth = 1)
  xgb=xgboost(xtrain, label = ytrain, nrounds = 500, params = params)
  conv_eta[,i] = xgb$evaluation_log$train_rmse
  pred_eta[,i] = predict(xgb, xtest)
}

conv_eta = data.frame(iter=1:500, conv_eta)
conv_eta = melt(conv_eta, id.vars = "iter")
ggplot(data = conv_eta) + geom_line(aes(x = iter, y = value, color = variable))

(RMSE_eta = sqrt(colMeans((ytest-pred_eta)^2)))

##      0.05       0.1       0.2       0.5         1
##  9.964462 10.052367 10.223738 13.691344 20.929690

colsample_bylevel

The next parameter controls the fraction of variables (characteristics) tested in each new node. Recall that if we use 1 all variables are tested. Values smaller than 1 test only the correspondent fraction of variables. The convergence shows that the model is much less sensitive to the colsample_bylevel. The curves with smaller values are just slightly above the curves with big values. The most accurate model seems to be the one that uses 25% of the sample in each tree. This result may change with different types of data. For example, the data we generated used the same distribution to create all response variables and the betas gave the variables a similar weight, which makes the sampling less relevant.

set.seed(1)
conv_cs = matrix(NA,500,length(cs))
pred_cs = matrix(NA,length(test), length(cs))
colnames(conv_cs) = colnames(pred_cs) = cs
for(i in 1:length(cs)){
  params = list(eta = eta[standard[1]], colsample_bylevel = cs[i],
              subsample = ss[standard[4]], max_depth = md[standard[3]],
              min_child_weigth = 1)
  xgb=xgboost(xtrain, label = ytrain,nrounds = 500, params = params)
  conv_cs[,i] = xgb$evaluation_log$train_rmse
  pred_cs[,i] = predict(xgb, xtest)
}

conv_cs = data.frame(iter=1:500, conv_cs)
conv_cs = melt(conv_cs, id.vars = "iter")
ggplot(data = conv_cs) + geom_line(aes(x = iter, y = value, color = variable))

(RMSE_cs = sqrt(colMeans((ytest-pred_cs)^2)))

## 0.333333333333333 0.666666666666667                 1
##          10.29836          10.05237          10.20938

max_depth

The third parameter is the maximum deepness allowed in each tree. Naturally, the model converges faster if we grow bigger trees (see the figure below). However, the best out-of-sample performance is for max_depth=4. Remember that bigger trees are more likely to result in over-fitting. In my experience there is no need to use values bigger than 4 to 6, but there may be exceptions.

set.seed(1)
conv_md=matrix(NA,500,length(md))
pred_md=matrix(NA,length(test),length(md))
colnames(conv_md)=colnames(pred_md)=md
for(i in 1:length(md)){
  params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]],
              subsample=ss[standard[4]],max_depth=md[i],
              min_child_weigth=1)
  xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params)
  conv_md[,i] = xgb$evaluation_log$train_rmse
  pred_md[,i] = predict(xgb, xtest)
}

conv_md=data.frame(iter=1:500,conv_md)
conv_md=melt(conv_md,id.vars = "iter")
ggplot(data=conv_md)+geom_line(aes(x=iter,y=value,color=variable))

(RMSE_md=sqrt(colMeans((ytest-pred_md)^2)))

##         2         4         6        10
## 12.502733  9.374257 10.134965 10.100691

sub_sample

The next parameter determines if we are estimating a Boosting or a Stochastic Boosting. Smaller values result in bigger errors in-sample but it may generate more robust out-of-sample estimates. However, as mentioned before, you will obtain bigger improvement on samples that have to many characteristics compares to the number of observations and when the data is very noisy. This is not the case here. Nevertheless, there is some improvement compared to the deterministic case, which I believed is mostly caused by outliers in the response variable (boxplot the variable y to see).

set.seed(1)
conv_ss=matrix(NA,500,length(ss))
pred_ss=matrix(NA,length(test),length(ss))
colnames(conv_ss)=colnames(pred_ss)=ss
for(i in 1:length(ss)){
  params=list(eta=eta[standard[1]],colsample_bylevel=cs[standard[2]],
              subsample=ss[i],max_depth=md[standard[3]],
              min_child_weigth=1)
  xgb=xgboost(xtrain, label = ytrain,nrounds = 500,params=params)
  conv_ss[,i] = xgb$evaluation_log$train_rmse
  pred_ss[,i] = predict(xgb, xtest)
}

conv_ss=data.frame(iter=1:500,conv_ss)
conv_ss=melt(conv_ss,id.vars = "iter")
ggplot(data=conv_ss)+geom_line(aes(x=iter,y=value,color=variable))

(RMSE_ss=sqrt(colMeans((ytest-pred_ss)^2)))

##      0.25       0.5      0.75         1
##  9.731273 10.052367 11.119535 11.233855

The main conclusion here is that there is no unique rule to tune Boosting models. The best way is to test several configurations.

Stay tuned if you liked this article, we will be talking more about Boosting soon.

References

(Boosting) Friedman, Jerome H. “Greedy function approximation: a gradient boosting machine.” Annals of statistics (2001): 1189-1232.

(Stochastig Boosting) Friedman, Jerome H. “Stochastic gradient boosting.” Computational Statistics & Data Analysis 38.4 (2002): 367-378.

(CART) Breiman, Leo. Classification and regression trees. Routledge, 2017.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.