Interpretability and explainability of predictive models
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In 400 AD, in his Confessiones, Augustine wrote
quid est ergo tempus? si nemo ex me quaerat, scio; si quaerenti explicare velim, nescio
that can be translated as
What then is time? If no one asks me, I know what it is. If I wish to explain it to him who asks, I do not know.
To go a little further (because often, if we are asked to explain, we have some ideas), in A Study in Scarlet by Sir Arthur Conan Doyle, published in 1887, we have the following exchange, between Sherlock Holmes and Doctor Watson
– “I wonder what that fellow is looking for?” I asked, pointing to a stalwart, plainly-dressed individual who was walking slowly down the other side of the street, looking anxiously at the numbers. He had a large blue envelope in his hand, and was evidently the bearer of a message.
– “You mean the retired sergeant of Marines,” said Sherlock Holmes.
then, as it turns out that the person is indeed a sergeant in the navy (as is another character in the story, someone named Arthur Charpentier), Dr. Holmes asks him for an explanation, he wants to know how he arrived at this conclusion
– “How in the world did you deduce that?” I asked.
– “Deduce what?” said he, petulantly.
– “Why, that he was a retired sergeant of Marines.”
– “I have no time for trifles,” he answered, brusquely; then with a smile, “Excuse my rudeness. You broke the thread of my thoughts; but perhaps it is as well. So you actually were not able to see that that man was a sergeant of Marines?”
– “No, indeed.”
– “It was easier to know it than to explain why I knew it. If you were asked to prove that two and two made four, you might find some difficulty, and yet you are quite sure of the fact. Even across the street I could see a great blue anchor tattooed on the back of the fellow’s hand. That smacked of the sea. He had a military carriage, however, and regulation side whiskers. There we have the marine. He was a man with some amount of self-importance and a certain air of command. You must have observed the way in which he held his head and swung his cane. A steady, respectable, middle-aged man, too, on the face of him – all facts which led me to believe that he had been a sergeant.”
(to be honest, it is Liu Cixin who talks about it in The Three-Body Problem). For the record, this is the first story of the Holmes-Watson couple, which introduces Sherlock Holmes’ working method. For those who are familiar with the short stories, this narrative approach will be widely used thereafter: Sherlock Holmes states a fact, Dr. Watson is astonished and asks for an explanation, and Sherlock Holmes explains, point by point, how he arrived at this conclusion. This is a bit like the approach we try to implement when we build a predictive model: on the basis of the Titanic data, if we predict that such and such a person will die, and that such and such a person will survive, we want to understand why the model arrives at this conclusion.
Beyond the general reflections, from “linear models are not as simple to interpret and explain as they seem” to “one can explain without being able to predict, and predict without being able to explain”, I wanted to come back to the classical mathematical notions used when we talk about the explainability of predictive models. On the basis of the Titanic data,
library(DALEX) data("titanic") titanic = DALEX::titanic idx = which((is.na(titanic$age))+ (is.na(titanic$sibsp))+ (is.na(titanic$parch))==0) titanicNA = titanic[idx,]
we will consider six different models to predict survivorship
- a logistic GLM and a GAM version (with a spline on the age)
- a classification tree CART and a random forest RF
- a boosting model GBM and a support vector machine SVM
library(splines) titanic_glm = glm(survived == "yes" ~ gender + age + class + sibsp + parch + embarked, titanicNA, family="binomial") titanic_gam = glm(survived == "yes" ~ gender + bs(age) + class + sibsp + parch + embarked, titanicNA, family="binomial") library("rpart") titanic_cart = rpart(survived == "yes" ~ class + gender + age + sibsp + parch + embarked, data = titanicNA) library("gbm") set.seed(1234) titanic_gbm = gbm(survived == "yes" ~ class + gender + age + sibsp + parch + embarked, data = titanicNA, n.trees = 15000, distribution = "bernoulli") library("randomForest") set.seed(1234) titanic_rf = randomForest(survived ~ class + gender + age + sibsp + parch + embarked, data = titanicNA) library("e1071") titanic_svm = svm(survived == "yes" ~ class + gender + age + sibsp + parch + embarked, data = titanicNA, type = "C-classification", probability = TRUE)
and to try to explain, we will see local methods, with two fictitious passengers, Kate and Leonardo (as I had already done in the past).
newbase = data.frame( class = factor(c("1st","3rd"), levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")), gender = factor(c("female","male"), levels = c("female", "male")), age = c(17,20), sibsp = c(1,0), parch = c(2,0), embarked = factor(c("Southampton","Southampton"), levels = c("Belfast","Cherbourg","Queenstown","Southampton"))) rownames(newbase) = c("Winslet, Miss. Kate","DiCaprio, Mr. Leonardo")
We will use here, to illustrate, the DALEX R package, by Przemyslaw Biecek, detailed in the book writen with Tomasz Burzykowski, Explanatory model analysis,
titanic_cart_exp = DALEX::explain(model = titanic_cart, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "cart", type = "classification") titanic_glm_exp = DALEX::explain(model = titanic_glm, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "glm", type = "classification") titanic_gam_= DALEX::explain(model = titanic_gam, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "gam", type = "classification") titanic_rf_exp = DALEX::explain(model = titanic_rf, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "rf") titanic_gbm_exp = DALEX::explain(model = titanic_gbm, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "gbm") titanic_svm_exp = DALEX::explain(model = titanic_svm, data = titanicNA[, -9], y = titanicNA$survived == "yes", label = "svm")
But first, let us return to the distinction between ceteris paribus and mutatis mutandis. Ceteris paribus (or rather ceteris paribus sic stantibus) is the Latin phrase that translates by “all things being equal”. Mutatis mutandis translates as “what should be changed has been changed” or “once the necessary changes have been made”. This distinction will be important later on. To illustrate, consider a simple model obesity=m(weight,height). In the first case, to understand the impact of the weight on obesity, we will consider m(weight=x+dx,taille=y)−m(weight=x,taille=y) as on the picture below
In the second case, we want to take into account the fact that an individual of different weight would probably also be of different height, and so we should look atm(weight=x+dx,taille=y+ϵ)−m(weight=x,taille=y)as on the picture below
where ϵ would integrate, somehow, the correlation between the two component.
To formalize this, consider a Gaussian model, that is two random variables (X1,X2) (that might be the height and the weight, for instance) such that (X1X2)∼N((μ1μ2),(σ21rσ1σ2rσ1σ2σ22)). In that case, consider (X⊥1,X⊥2) an independent version of (X1,X2) (i.e. same marginal distributions, but independent). The (classical) conditional expectation can be written EX1[X2|x∗1]=μ2+rσ2σ1(x∗1−μ1)and we will use the notationEX⊥1[X2|x∗1]=E[X2]=μ2in the case where components are independent.
This notation with the conditional random variable appearing as an index might be surprising, but here, EX1 means that the expected value is not under P, but underPX2|X1, also denoted, sometimes EPX2|X1. I will use it for convenience here. From a statistical perspective, the later one (with independent variables), it means that EX⊥1[h(X1,X2)|x∗1]≈1n∑ni=1h(x∗1,xi,2) while the true classical conditional expectation should be understood as the average in the neighborhood of x∗1 and therefore EX1[h(X1,X2)|x∗1]≈1‖Vϵ(x∗1)‖∑i∈Vϵ(x∗1)h(x∗1,xi,2)where Vϵ(x∗1) precisely denotes the neighborhood, i.e.Vϵ(x∗1)={i:|xi,1−x∗1|≤ϵ}
Before going forward, I want to stress here that we will talk about “local” explanability when we try to explain a single prediction, given by model m for some very specific individual (denoted x∗) while the “global” approach usually means that we want to quantify variable importance in model m (and possibly, consider functions of x∗j).
- Global Approach: variable importance
Variable importance is a first tool to understand which variables are important in a predictive modelm. Fisher et al. (2019) suggested the following definition : given a loss function ℓ, defineVIj=E[ℓ(Y,m(X−j,Xj))]−E[ℓ(Y,m(X−j,X⊥j))]and the empirical version is^VIj=1n∑ni=1ℓ(yi,m(xi,−j,xi,j))−ℓ(yi,m(xi,−j,˜xi,j))for some permutation ˜xj of vector xj.
vip_glm_999 = model_parts(explainer = titanic_glm_exp, loss_function = 1-AUC, B = 999, type = "difference")) plot(vip_glm_999)
- Local Approach : ICE or ceteris paribus
Goldstein et al. (2015) introduced the concept of ICE (individual conditional expectation), which is simply the functional ceteris paribusz↦mx∗,j(z)=m(x∗−j,z)=m(x∗1,⋯,x∗j−1,z,x∗j+1,⋯,x∗p) in a given point x∗∈X (with a small abuse of notation on the indices, since one will note abusively (x∗−j,x∗j) whatever position of index j).
We can then look at δmx∗,j(z)=mx∗,j(z)–mx∗,j(x∗j), or more interestingly, The mean absolute deviation of the j-th variable, in x∗, is dmj(x∗) dmj(x∗)=E[|δmx∗,j(Xj)|]=E[|m(x−j,Xj)–m(x−j,x∗j)|]The empirical mean absolute deviation of the j-th variable en x∗, est^dmj(x∗)=1nn∑i=1|m(x−j,xi,j)−m(x−j,x∗j)|For example, just for Kate, and just for logistic regression, the code looks like
cp_titanic_glm = predict_profile(explainer = titanic_glm_exp, new_observation = newbase[1,]) plot(cp_titanic_glm, variables = "age")
Here, for our 6 models (we simply replace the model to be explained in the previous code), we look at the age evolution curve, for Kate (what would have happened if Kate did not have an age x∗j=17 but x – all other things being equal, ceteris paribus)
For example, just for Kate, and just for logistic regression, the code looks like
or for the class (what would have been the survival probability if Kate was not in class x∗j=1 (first class) but in class x – everything else remaining unchanged (ceteris paribus).
plot(cp_titanic_glm, variables = "class")
- Local Approach : additive (break-point) decomposition
For a standard linear modelˆm(x∗)=ˆβ0+ˆβ⊤x∗=ˆβ0+∑pj=1ˆβjx∗j=¯y+∑pj=1ˆβj(x∗j−¯xj)⏟=vj(x∗)
where vj(x∗) will be seen as the contribution of variable j in the prediction, for x∗.
More generally, Robnik-Šikonja and Kononenk (1997, 2003 and 2008), defined the contribution of the j-th variable, in x∗, asvj(x∗)=m(x∗1,⋯,,x∗j−1,x∗j,x∗j+1,⋯,x∗p)–EX⊥−j[m(x∗1,⋯,x∗j−1,Xj,x∗j+1,⋯,x∗p)]such thatm(x∗)=E[m(X)]+∑pj=1vj(x∗)therefore, for a linear model vj(x∗)=βj(x∗j–EX⊥−j[Xj]) and ˆvj(x∗)=ˆβj(x∗j−¯xj).
But more generallyvj(x∗)=m(x∗)–EX⊥−j[m(x∗−j,Xj))] where we can write m(x∗) as EX[m(x∗)], i.e.
vj(x∗)={EX[m(X)|x∗1,⋯,x∗p]−EX⊥−j[m(X)|x∗1,⋯,x∗j−1,x∗j+1,⋯,x∗p]EX[m(X)|x∗]–EX⊥−j[m(X)|x∗−j]
The contribution of the j-th variable, in x∗, usγbdj(x∗)=vj(x∗)=EX[m(X)|x∗]−EX⊥−j[m(X)|x∗−j]This is clearly a ceteris paribus approach, as we can see below (where I will simply draw a couple of pictures). Consider a set X such that 0≤x2≤x1≤1. Below, points can be in the blue region, not in the red one,
We want to understand the prediction given by the least squares regression, for one of the points, i.e. to get a break-down decomposition of ˆy
First, we compute the average value of y
Then, we will compute averages, ceteris paribus, the first one being \displaystyle{
\mathbb{E}_{X_1^\perp}\big[m(x_1^*,X_2)\big] \approx \frac{1}{n}\sum_{i=1}^n m(x_1^*,x_{2,i})}where the sum is obtained when x∗1 is fixed (why not)
and the second one isEX⊥2[m(X1,x∗2)]≈1nn∑i=1m(x1,i,x∗2)i.e. we sum when x∗2 is fixed, which is rather odd in this example since we consider some pseudo-observations (x1,i,x∗2) that are clearly in the red area, where we should have no points…
In the last two cases, the contribution will be the conditional mean, from which the global mean is subtracted (this is the height that can be seen in yellow). And we can show that the mean, to which we add the two contributions, gives the predicted value.
On the Titanic data, for Kate and just the logistic regression
bd_glm_kate = DALEX::predict_parts(explainer = titanic_glm_exp, new_observation = newbase[1,], type = "break_down", order = c("gender","class", "age", "parch", "sibsp", "embarked")) plot(bd_glm_kate)
To go a little further, note that we can define the contribution of the j-th conditional variable to a group of variables S⊂{1,⋯,p}∖{j}, in x∗, estΔj|S(x∗)={EX⊥S,X⊥j[m(X)|x∗S,x∗j]–EX⊥S[m(X)|x∗S]EX⊥S∪{j}[m(X)|x∗S∪{j}]−EX⊥S[m(X)|x∗S]so that vj(x∗)=Δj|{1,2,⋯,p}∖{j}=Δj|−j.
More generally, define also ΔK|S(x∗), or Δi,j|S(x∗) (which would allow further analysis of possible interractions, but we’ll pass quickly).
- Approche locale : la décomposition de Shapley
As a reminder, in a (very) general context, ∀S⊂{1,…,p}, we have a val(S) function, and we are looking for ϕj(val) contributions, checking some criteria
- efficiency: p∑j=1ϕj(val)=val({1,…,p})
- symmetry: if val(S∪{j})=val(S∪{k}), ∀S⊆{1,…,p}∖{j,k}, then ϕj=ϕk
- dummy: if val(S∪{j})=val(S), ∀S⊆{1,…,p}, then ϕj=0
- additivity: if val(1) and val(2) have the decompositions ϕ(1) and ϕ(2), then val(1)+val(2) has the decomposition ϕ(1)+[latex]ϕ(2)
Shapley (1953) proved that the ony functions satisfying those criteria areϕj(val)=∑S⊆{1,…,p}∖{j}|S|!(p−|S|−1)!p!(val(S∪{j})−val(S))or encoreϕj(val)=1p∑S⊆{1,…,p}∖{j}(p−1|S|)−1(val(S∪{j})−val(S))
Here, we will use val(S)=EX⊥S[m(X)|x∗S]
The contribution of the j-th variable, in x∗, is thereforeγshapj(x∗)=1p∑S⊆{1,…,p}∖{j}(p−1|S|)−1Δj|S(x∗)
- local accuracy: p∑j=1γshapj(x∗)=m(x∗)−E[m(X)]
- symmetry: if j and k are exchangeable, γshapj(x∗)=γshapk(x∗)
\item dummy: if Xj does not contribute, γshapj(x∗)=0 - additivity: if m=m1+m2, γshapj(x∗;m)=γshapj(x∗;m1)+γshapj(x∗;m2)
Observe that if p=2, γshap1(x∗)=Δ1|2(x∗)=γbd1(x∗)
And if p≫2, lthe calculations can quickly become heavy. Štrumbelj and Kononenko (2014) have proposed a method using simulations. From x∗ and an individual xi, we construct ˜xj={x∗j with probability 1/2xi,j with probability 1/2and {x∗+i=(˜x1,⋯,x∗j,⋯,˜xp)x∗−i=(˜x1,⋯,xi,j,⋯,˜xp)and we note that γshapj(x∗)≈m(x∗+i)−m(x∗−i), and thusˆγshapj(x∗)=1s∑i,⋯,n}m(x∗+i)−m(x∗−i)
(we draw at each step an individual i in the training dataset, s times).
shap_glm = DALEX::predict_parts(explainer = titanic_glm_exp, new_observation = newbase[1,], type = "shap", order = c("gender","class", "age", "parch", "sibsp", "embarked")) plot(shap_gLm, show_boxplots=FALSE)
or with confidence boxplots
plot(shap_gLm, show_boxplots=TRUE)
Observe that instead of x∗, we can consider all points in the training set
And we can actually plot the scatterplot {(xi,j,γshapj(xi))} (on the training set) also called “Shapley Dependence Plots”.
or
Instead of a local vision (at point x∗) it is actually possible to get a global one..
Štrumbelj and Kononenko (2014), and then Lundberg and Lee (2017) suggested to use the Shapley decomposition to compute a global feature importance function. Shapley Feature Importance is simplyγshapj=1n∑ni=1γshapj(xi)
- Local Approach : LIME (Local Interpretable Model-Agnostic Explanations)
Give a (black box) model m defined on X, Ribeiro, Singh and Guestrin (2016) suggested to solveargminme∈E{ℓx∗(m,me)}+P(me)where
- E is a subset of models X→R that are supposed to be “explainable” (like a linear model, or some tree), or possibly ˜X→R, where ˜X is a subspace of X (called space for interpretable representation)
- ℓx∗ is a loss function, defined on the neigborhood of x∗
- P is a penalty function, increasing in the complexity of the model
(I will mention here some old slides used a few years ago to explain models on pictures). Here pictures are our indivudals x in dimension 30000 (we have 100×100 pixels pictures, in coulors, so that the true dimension is three times more. We clearly see on the top right that the first step of the LIME procedure is to simplify our large dimensional individual, by creating a 6× object. This approach creates surrogate models (locally)
library(localModel) library(DALEXtra) localModel_1_glm = predict_surrogate(explainer = titanic_glm_exp, new_observation = newbase[1,], size = 1000, type = "localModel") plot_interpretable_feature(localModel_1_glm,"age")
- Local Approach : local-diagnostic plots
From our point x∗, we seek its closest neighbors, and we compare
- the global distribution of residuals
- the distribution of residuals only for the neighbors of x∗
using e.g. Gower’s distance dG(xi,x∗)=1p∑pj=1dj(xi,j,x∗j),wheredj(xi,j,x∗j)={|xi,j−x∗j|max{xj}−min{xj}, is j continuous1(xi,j≠x∗j), if j categorical
res_glm_1 = predict_diagnostics(explainer = titanic_glm_exp, new_observation = newbase[1,], neighbors = 25) plot(res_glm_1)
with the overall residuals below, in green, and for the neighbors of x∗ on top, in blue.
We can also plot the ceteris paribus plot of x∗, e.g. for the age, and the ones of the neighbors (here Kate’s neighbors), with residuals either in red (negative) or green (positive)
res_glm_1_age = predict_diagnostics(explainer = titanic_glm_exp, new_observation = newbase[1,], neighbors = 25, variables = "age") plot(res_glm_1_age)
- Global Approach : Partial dependence plot
The Partial Dependence Plot of variable j is function Xj→Rpj(x∗j)=EX⊥j[m(X)|x∗j]and its empirical version isˆpj(x∗j)=1n∑ni=1m(x∗j,xi,−j)=1n∑ni=1mxi(x∗j)⏟ceteris paribus
pdp_glm = model_profile(explainer = explain_glm, variables = "age") plot(pdp_glm)
where we add a few ceteris paribus curves (the average curve is still the large blue one)
Note that instead of the average overallˆpj(x∗j)=1n∑ni=1m(x∗j,xi,−j)=1n∑ni=1mxi(x∗j)⏟ceteris paribusit is possible to consider averages on subsets, e.g. for men and womenˆpj(x∗j)=1nk∑i∈groupkmxi(x∗j)⏟ceteris paribus
pdp_glm = model_profile(explainer = explain_glm, variables = "age", groups = "gender") plot(pdp_glm)
or grouped in two classes in an unsupervised way (the subgroups are here constituted by models, one thus has no coherence between the drawings)
pdp_glm = model_profile(explainer = explain_glm, variables = "age", k = 3) plot(pdp_glm)
- Global Approach : Local dependence plot
Notice that pj(x∗j)=EX⊥j[m(X)|x∗j] did not take into account the fact that, generally, (X−j|Xj)L≠X−j. Therefore, instead of the previous ceteris paribus approach, used to get PDPs, we can consider a mutandis mutatis approach.
Apley and Zhu (2020) definedℓj(x∗j)=EXj[m(X)|x∗j]pour l’approche théorique, et la version empirique est alorsˆℓj(x∗j)=1card(V(x∗j))∑i∈V(x∗j)m(x∗j,xi,−j)où V(x∗j)={i:d(xi,j,x∗j)≤ϵ} est le voisinage, ouˆℓj(x∗j)=1∑iωi(x∗j)∑ni=1ωi(x∗j)m(x∗j,xi,−j)où ωi(x∗j)=Kh(x∗j−xi,j) pour une version lissée par noyau.
loc_glm = model_profile(explainer = explain_glm, variables = "age", type="conditional")
- Global Approach : Accumulated Local Effects
Defineaj(x∗j)=∫x∗j−∞EXj[∂m(xj,X−j)∂xj|xj]dxjwhere here∂m(xj,X−j)∂xj describes the local change of the model m from xj (ceteris paribus), \(m(x_j+dx_j,\boldsymbol{x}_{-j})-m(x_j, \boldsymbol{x}_{-j}) \approx \displaystyle{\frac{\partial m(x_j,\boldsymbol{X}_{-j})}{\partial{x}_{-j}}dx_j\)and we look at the local mean value.
The empirical version isˆaj(x∗j)=α+∑k∗ju=11nu∑u:xi,j∈(au−1,au][m(ak,xi,−j)−m(ak−1,xi,−j)](where α denotes a normalization constant that insures that E[aj(Xj)]=0).
Apley and Zhu (2020) suggested either a discretization of Xj (partition {(ak−1,ak]}), or a smooth version (using a kernel)
acc_glm = model_profile(explainer = explain_glm, variables = "age", type="accumulated")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.