Site icon R-bloggers

Is catboost the best gradient boosting R package?

[This article was first published on Philipp Probst, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Several R packages that use different methods are out there for using gradient boosting methods. The three most famous ones are currently xgboost, catboost and lightgbm. I want to compare these three to find out which is the best one in their default mode without tuning. These algorithms are not pure gradient boosting algorithms but combine it with other useful methods such as bagging which is for example used in random forest.

< !--excerpt-->

Algorithms and Installation (for R)

Algorithm versions

The algorithms are used with their most actual version on first of March. checkpoint R package is used to obtain the correct versions on this date for the packages available on CRAN. For the other I took the version on github available on this date.

Parameter settings

I use the algorithms mainly with their default parameter settings:

Multi threading

All algorithms provide parallel execution on several CPU threads. I used 5 threads for each algorithm.

Adjustments for categorical data

Unfortunately xgboost does not provide an automatic method for categorical input data. I implemented a manual method by which categorical variables were automatically converted to dummy variables. catboost and lightgbm do provide methods for categorical input data.

Datasets

I took a collection of 31 regression datasets of different domains, that I collected from different sources such as OpenML. The datasets are rather small. They can be found via OpenML and the tag “OpenML-Reg19”:

library(OpenML)
tasks = listOMLTasks(tag = "OpenML-Reg19")
tasks = tasks[!(tasks$name %in% c("aloi", "BNG(satellite_image)", "black_friday")),]  

I excluded the datasets “aloi”, “BNG(satellite_image)” and “black_friday”, because calculation took very long time or they provided errors on some of the algorithms.

Evaluation

I used 10-times repeated 5-fold crossvalidation for each of the algorithms to estimate the performance of the algorithms on the datasets. I use the measures R-Squared and the non-parametric measure Spearmans-Rho, which only evaluates, if the observations are predicted in the correct order.

Code

The code is available on this github page: OpenML-bench.

Results

The results for R-Squared can be seen here:

Of the three gradient boosting algorithms catboost performs best in general and is outperformed only in very few cases by the other algorithms. The xgboost_best version of xgboost usually provides better results than the default parameter settings (xgboost_def). In some cases, especially when R-Squared is negative (which means that the algorithm is worse than simply taking the average), random forest (ranger) performs better. Possibly on these datasets outliers lead to the worse performance, but this could be examined further.

The results for Spearmans-Rho are similar:

All algorithms are above 0 for each dataset, which means that they provide better ordering than random predictions. On most datasets catboost provides the best results. On some datasets lightgbm provides bad results compared to the other algorithms.

Calculation time

In the following we can see the average calculation time results:

The datasets are rather small, so the calculation time is also rather small (below 30 seconds on 5 CPU threads). catboost and xgboost_best have longest runtimes lightgbm is the fastest.

Further work

Results for classification will follow in a future blog post. If you have comments to the analysis, better defaults of the packages or other comparable packages, let me know. Some of the algorithms are still under development, so results may change.

Annex

I also examined liquidSVM, but it provided bad results on several datasets, which was different from results I obtained some months before. So I excluded it from the analysis for the moment.

To leave a comment for the author, please follow the link and comment on their blog: Philipp Probst.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.