Some general thoughts on Partial Dependence Plots with correlated covariates

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The partial dependence plot is a nice tool to analyse the impact of some explanatory variables when using nonlinear models, such as a random forest, or some gradient boosting.The idea (in dimension 2), given a model m(x1,x2) for E[Y|X1=x1,X2=x2]. The partial dependence plot for variable x1 is model m is function p1 defined as x1EPX2[m(x1,X2)]. This can be approximated, using some dataset using ˆp1(x1)=1nni=1m(x1,x2,i)My concern here what the interpretation of that plot when there are some (strongly) correlated covariates. Let us generate some dataset to start with

n=1000
library(mnormt)
r=.7
set.seed(1234)
X = rmnorm(n,mean = c(0,0),varcov = matrix(c(1,r,r,1),2,2))
Y = 1+X[,1]-2*X[,2]+rnorm(n)/2
df = data.frame(Y=Y,X1=X[,1],X2=X[,2])

As we can see, the true model is here is yi=β0+β1x1,i+β2x2,i+εi where β1=1 but the two variables are positively correlated, and the second one has a strong negative impact. Note that here

reg = lm(Y~.,data=df)
summary(reg)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.01414    0.01601   63.35   <2e-16 ***
X1           1.02268    0.02305   44.37   <2e-16 ***
X2          -2.03248    0.02342  -86.80   <2e-16 ***

If we estimate a wrongly specified model yi=b0+b1x1,i+ηi, we would get

reg1 = lm(Y~X1,data=df)
summary(reg1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.03522    0.04680  22.121   <2e-16 ***
X1          -0.44148    0.04591  -9.616   <2e-16 ***

Thus, on the proper model, ˆβ1+1.02 while ˆb10.44  on the mispecified model.

Now, let us look at the parial dependence plot of the good model, using standard R dedicated packages,

library(pdp) 
pdp::partial(reg, pred.var = "X1", plot = TRUE,
              plot.engine = "ggplot2")

which is the linear line y=1+x, that corresponds to y=β0+β1x.

library(DALEX)
plot(DALEX::single_variable(DALEX::explain(reg,
data=df),variable = "X1",type = "pdp"))

which corresponds to the previous graph. Here, it is also possible to creaste our own function to compute that partial dependence plot,

pdp1 = function(x1){
  nd = data.frame(X1=x1,X2=df$X2)
  mean(predict(reg,newdata=nd))
}

that will be the straight line below (the dotted line is the theoretical one y=1+x,

vx=seq(-3.5,3.5,length=101)
vpdp1 = Vectorize(pdp1)(vx)
plot(vx,vpdp1,type="l")
abline(a=1,b=1,lty=2)

which is very different from the univariate regression on x1

abline(reg1,col="red")

Actually, the later is very consistent with a local regression, only on x1

library(locfit)
lines(locfit(Y~X1,data=df),col="blue")

Now, to get back to the definition of the partial dependence plot, x1EPX2[m(x1,X2)], in the context of correlated variable, I was wondering if it would not make more sense to consider some local version actually, something like x1EPX2|X1[m(x1,X2)]. My intuition was that, somehow, it did not make any sense to consider any X2 while X1 was fixed (and equal to x1). But it would make more sense actually to look at more valid X2‘s given the value of X1. And a natural estimate could be some k neareast-neighbors, i.e. ˜p1(x1)=1kniVk(x)m(x1,x2,i) where Vk(x) is the set of indices of the k xi‘s that are the closest to x, i.e.

lpdp1 = function(x1){
  nd = data.frame(X1=x1,X2=df$X2)
  idx = rank(abs(df$X1-x1))
  mean(predict(reg,newdata=nd[idx<50,]))
}
vlpdp1 = Vectorize(lpdp1)(vx)
lines(vx,vlpdp1,col="darkgreen",lwd=2)

Surprisingly (?), this local partial dependence plot gives a curve that corresponds to the simple regression…

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)