Gradient-Boosting anything (alert: high performance)

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve always been told that decision trees are best for Gradient Boosting Machine Learning. I’ve always wanted to see for myself. AdaBoostClassifier is working well, but is relatively slow (by my own standards). A few days ago, I noticed that my Cython implementation of LSBoost in Python package mlsauce was already quite generic (never noticed before), and I decided to adapt it to any machine learning model with fit and predict methods. It’s worth mentioning that only regression algorithms are accepted as base learners, and classification is regression-based. The results are promising indeed; I’ll let you see for yourself below. All the algorithms, including xgboost and RandomForest, are used with their default hyperparameters. Which means, there’s still a room for improvement.

Install mlsauce (version 0.20.3) from GitHub:

!pip install git+https://github.com/Techtonique/mlsauce.git --verbose --upgrade --no-cache-dir

import os
import pandas as pd
import mlsauce as ms
from sklearn.datasets import load_breast_cancer, load_iris, load_wine, load_digits
from sklearn.model_selection import train_test_split
from time import time

load_models = [load_breast_cancer, load_wine, load_iris]

for model in load_models:

    data = model()
    X = data.data
    y= data.target
    X = pd.DataFrame(X, columns=data.feature_names)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 13)

    clf = ms.LazyBoostingClassifier(verbose=0, ignore_warnings=True,
                                    custom_metric=None, preprocess=False)

    start = time()
    models, predictions = clf.fit(X_train, X_test, y_train, y_test)
    print(f"\nElapsed: {time() - start} seconds\n")

    display(models)


2it [00:01,  1.52it/s]
100%|██████████| 30/30 [00:21<00:00,  1.38it/s]


Elapsed: 23.019137859344482 seconds
Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
Model
GenericBooster(LinearRegression) 0.99 0.99 0.99 0.99 0.35
GenericBooster(Ridge) 0.99 0.99 0.99 0.99 0.27
GenericBooster(RidgeCV) 0.99 0.99 0.99 0.99 1.07
GenericBooster(TransformedTargetRegressor) 0.99 0.99 0.99 0.99 0.40
GenericBooster(KernelRidge) 0.97 0.96 0.96 0.97 2.05
XGBClassifier 0.96 0.96 0.96 0.96 0.91
GenericBooster(ExtraTreeRegressor) 0.94 0.94 0.94 0.94 0.25
RandomForestClassifier 0.92 0.93 0.93 0.92 0.40
GenericBooster(RANSACRegressor) 0.90 0.86 0.86 0.90 15.22
GenericBooster(DecisionTreeRegressor) 0.87 0.88 0.88 0.87 0.98
GenericBooster(KNeighborsRegressor) 0.87 0.89 0.89 0.87 0.49
GenericBooster(ElasticNet) 0.85 0.76 0.76 0.84 0.10
GenericBooster(Lasso) 0.82 0.71 0.71 0.79 0.09
GenericBooster(LassoLars) 0.82 0.71 0.71 0.79 0.10
GenericBooster(DummyRegressor) 0.68 0.50 0.50 0.56 0.01
2it [00:00,  8.29it/s]
100%|██████████| 30/30 [00:15<00:00,  1.92it/s]


Elapsed: 15.911818265914917 seconds
Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
Model
RandomForestClassifier 1.00 1.00 None 1.00 0.18
GenericBooster(ExtraTreeRegressor) 1.00 1.00 None 1.00 0.16
GenericBooster(KernelRidge) 1.00 1.00 None 1.00 0.38
GenericBooster(LinearRegression) 1.00 1.00 None 1.00 0.23
GenericBooster(Ridge) 1.00 1.00 None 1.00 0.17
GenericBooster(RidgeCV) 1.00 1.00 None 1.00 0.24
GenericBooster(TransformedTargetRegressor) 1.00 1.00 None 1.00 0.26
XGBClassifier 0.97 0.96 None 0.97 0.06
GenericBooster(Lars) 0.94 0.94 None 0.95 0.99
GenericBooster(DecisionTreeRegressor) 0.92 0.92 None 0.92 0.23
GenericBooster(KNeighborsRegressor) 0.92 0.93 None 0.92 0.21
GenericBooster(RANSACRegressor) 0.81 0.81 None 0.80 12.63
GenericBooster(ElasticNet) 0.61 0.53 None 0.53 0.04
GenericBooster(DummyRegressor) 0.42 0.33 None 0.25 0.01
GenericBooster(Lasso) 0.42 0.33 None 0.25 0.02
GenericBooster(LassoLars) 0.42 0.33 None 0.25 0.01
2it [00:00,  5.14it/s]
100%|██████████| 30/30 [00:15<00:00,  1.92it/s]


Elapsed: 16.0275661945343 seconds
Accuracy Balanced Accuracy ROC AUC F1 Score Time Taken
Model
GenericBooster(Ridge) 1.00 1.00 None 1.00 0.23
GenericBooster(RidgeCV) 1.00 1.00 None 1.00 0.25
RandomForestClassifier 0.97 0.97 None 0.97 0.26
XGBClassifier 0.97 0.97 None 0.97 0.12
GenericBooster(DecisionTreeRegressor) 0.97 0.97 None 0.97 0.27
GenericBooster(ExtraTreeRegressor) 0.97 0.97 None 0.97 0.22
GenericBooster(LinearRegression) 0.97 0.97 None 0.97 0.15
GenericBooster(TransformedTargetRegressor) 0.97 0.97 None 0.97 0.37
GenericBooster(KNeighborsRegressor) 0.93 0.95 None 0.93 1.52
GenericBooster(KernelRidge) 0.87 0.83 None 0.85 0.63
GenericBooster(RANSACRegressor) 0.63 0.59 None 0.61 10.86
GenericBooster(Lars) 0.50 0.46 None 0.48 0.99
GenericBooster(DummyRegressor) 0.27 0.33 None 0.11 0.01
GenericBooster(ElasticNet) 0.27 0.33 None 0.11 0.01
GenericBooster(Lasso) 0.27 0.33 None 0.11 0.01
GenericBooster(LassoLars) 0.27 0.33 None 0.11 0.01
!pip install shap

import shap

best_model = clf.get_best_model()

# load JS visualization code to notebook
shap.initjs()

# explain all the predictions in the test set
explainer = shap.KernelExplainer(best_model.predict_proba, X_train)
shap_values = explainer.shap_values(X_test)
# this is multiclass so we only visualize the contributions to first class (hence index 0)
shap.force_plot(explainer.expected_value[0], shap_values[..., 0], X_test)

xxx

WARNING:shap:Using 120 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples.



  0%|          | 0/30 [00:00<?, ?it/s]
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)