R-bloggers

Effect Plots in Python and R

[This article was first published on R – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Christian and me did some code magic: Highly effective plots that help to build and inspect any model:

The functionality is best described by its output:

Python
R

The plots show different types of feature effects relevant in modeling:

Both implementations…

In the spirit of our “Lost In Translation” series, we provide both high-quality Python and R code. We will use the same data and models as in one of our latest posts on how to build strong GLMs via ML + XAI.

Example

Let’s build a Poisson LightGBM model to explain the claim frequency given six traditional features in a pricing dataset on motor liability claims. 80% of the 1 Mio rows are used for training, the other 20% for evaluation. Hyper-parameters have been slightly tuned (not shown).

library(OpenML)
library(lightgbm)

dim(df <- getOMLDataSet(data.id = 45106L)$data)  # 1000000 7
head(df)

#   year town driver_age car_weight car_power car_age claim_nb
# 0 2018    1         51       1760       173       3        0
# 1 2019    1         41       1760       248       2        0
# 2 2018    1         25       1240       111       2        0
# 3 2019    0         40       1010        83       9        0
# 4 2018    0         43       2180       169       5        0
# 5 2018    1         45       1170       149       1        1

yvar <- "claim_nb"
xvars <- setdiff(colnames(df), yvar)

ix <- 1:800000
train <- df[ix, ]
test <- df[-ix, ]
X_train <- data.matrix(train[xvars])
X_test <- data.matrix(test[xvars])

# Training, using slightly optimized parameters found via cross-validation
params <- list(
  learning_rate = 0.05,
  objective = "poisson",
  num_leaves = 7,
  min_data_in_leaf = 50,
  min_sum_hessian_in_leaf = 0.001,
  colsample_bynode = 0.8,
  bagging_fraction = 0.8,
  lambda_l1 = 3,
  lambda_l2 = 5,
  num_threads = 7
)

set.seed(1)

fit <- lgb.train(
  params = params,
  data = lgb.Dataset(X_train, label = train$claim_nb),
  nrounds = 300
)
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.datasets import fetch_openml

df = fetch_openml(data_id=45106, parser="pandas").frame
df.head()

#   year town driver_age car_weight car_power car_age claim_nb
# 0 2018    1         51       1760       173       3        0
# 1 2019    1         41       1760       248       2        0
# 2 2018    1         25       1240       111       2        0
# 3 2019    0         40       1010        83       9        0
# 4 2018    0         43       2180       169       5        0
# 5 2018    1         45       1170       149       1        1

# Train model on 80% of the data
y = df.pop("claim_nb")
n_train = 800_000
X_train, y_train = df.iloc[:n_train], y.iloc[:n_train]
X_test, y_test = df.iloc[n_train:], y.iloc[n_train:]

params = {
    "learning_rate": 0.05,
    "objective": "poisson",
    "num_leaves": 7,
    "min_child_samples": 50,
    "min_child_weight": 0.001,
    "colsample_bynode": 0.8,
    "subsample": 0.8,
    "reg_alpha": 3,
    "reg_lambda": 5,
    "verbose": -1,
}

model = LGBMRegressor(n_estimators=300, **params, random_state=1)
model.fit(X_train, y_train)

Let’s inspect the (main effects) of the model on the test data.

library(effectplots)

# 0.3 s
feature_effects(fit, v = xvars, data = X_test, y = test$claim_nb) |>
  plot(share_y = "all")
from model_diagnostics.calibration import plot_marginal

fig, axes = plt.subplots(3, 2, figsize=(8, 8), sharey=True, layout="tight")

# 2.3 s
for i, (feat, ax) in enumerate(zip(X_test.columns, axes.flatten())):
    plot_marginal(
        y_obs=y_test,
        y_pred=model.predict(X_test),
        X=X_test,
        feature_name=feat,
        predict_function=model.predict,
        ax=ax,
    )
    ax.set_title(feat)
    if i != 1:
        ax.legend().remove()

The output can be seen at the beginning of this blog post.

Here some model insights:

Final words

References

  1. Apley, Daniel W., and Jingyu Zhu. 2020. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82 (4): 1059–1086. doi:10.1111/rssb.12377.
  2. Friedman, Jerome H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29 (5): 1189–1232. doi:10.1214/aos/1013203451.

R script , Python notebook

To leave a comment for the author, please follow the link and comment on their blog: R – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version