Some general thoughts on Partial Dependence Plots with correlated covariates
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The partial dependence plot is a nice tool to analyse the impact of some explanatory variables when using nonlinear models, such as a random forest, or some gradient boosting.The idea (in dimension 2), given a model m(x1,x2) for E[Y|X1=x1,X2=x2]. The partial dependence plot for variable x1 is model m is function p1 defined as x1↦EPX2[m(x1,X2)]. This can be approximated, using some dataset using ˆp1(x1)=1n∑ni=1m(x1,x2,i)My concern here what the interpretation of that plot when there are some (strongly) correlated covariates. Let us generate some dataset to start with
n=1000 library(mnormt) r=.7 set.seed(1234) X = rmnorm(n,mean = c(0,0),varcov = matrix(c(1,r,r,1),2,2)) Y = 1+X[,1]-2*X[,2]+rnorm(n)/2 df = data.frame(Y=Y,X1=X[,1],X2=X[,2])
As we can see, the true model is here is yi=β0+β1x1,i+β2x2,i+εi where β1=1 but the two variables are positively correlated, and the second one has a strong negative impact. Note that here
reg = lm(Y~.,data=df) summary(reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.01414 0.01601 63.35 <2e-16 *** X1 1.02268 0.02305 44.37 <2e-16 *** X2 -2.03248 0.02342 -86.80 <2e-16 ***
If we estimate a wrongly specified model yi=b0+b1x1,i+ηi, we would get
reg1 = lm(Y~X1,data=df) summary(reg1) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.03522 0.04680 22.121 <2e-16 *** X1 -0.44148 0.04591 -9.616 <2e-16 ***
Thus, on the proper model, ˆβ1∼+1.02 while ˆb1∼−0.44 on the mispecified model.
Now, let us look at the parial dependence plot of the good model, using standard R dedicated packages,
library(pdp) pdp::partial(reg, pred.var = "X1", plot = TRUE, plot.engine = "ggplot2")
which is the linear line y=1+x, that corresponds to y=β0+β1x.
library(DALEX) plot(DALEX::single_variable(DALEX::explain(reg, data=df),variable = "X1",type = "pdp"))
which corresponds to the previous graph. Here, it is also possible to creaste our own function to compute that partial dependence plot,
pdp1 = function(x1){ nd = data.frame(X1=x1,X2=df$X2) mean(predict(reg,newdata=nd)) }
that will be the straight line below (the dotted line is the theoretical one y=1+x,
vx=seq(-3.5,3.5,length=101) vpdp1 = Vectorize(pdp1)(vx) plot(vx,vpdp1,type="l") abline(a=1,b=1,lty=2)
which is very different from the univariate regression on x1
abline(reg1,col="red")
Actually, the later is very consistent with a local regression, only on x1
library(locfit) lines(locfit(Y~X1,data=df),col="blue")
Now, to get back to the definition of the partial dependence plot, x1↦EPX2[m(x1,X2)], in the context of correlated variable, I was wondering if it would not make more sense to consider some local version actually, something like x1↦EPX2|X1[m(x1,X2)]. My intuition was that, somehow, it did not make any sense to consider any X2 while X1 was fixed (and equal to x1). But it would make more sense actually to look at more valid X2‘s given the value of X1. And a natural estimate could be some k neareast-neighbors, i.e. ˜p1(x1)=1k∑ni∈Vk(x)m(x1,x2,i) where Vk(x) is the set of indices of the k xi‘s that are the closest to x, i.e.
lpdp1 = function(x1){ nd = data.frame(X1=x1,X2=df$X2) idx = rank(abs(df$X1-x1)) mean(predict(reg,newdata=nd[idx<50,])) } vlpdp1 = Vectorize(lpdp1)(vx) lines(vx,vlpdp1,col="darkgreen",lwd=2)
Surprisingly (?), this local partial dependence plot gives a curve that corresponds to the simple regression…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.