Residuals from a logistic regression
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I always claim that graphs are important in econometrics and statistics ! Of course, it is usually not that simple. Let me come back to a recent experience. A got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals, obtained from a logistic regression ? To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model, this is why I decided to embed the code in some R source file)
> source("http://freakonometrics.free.fr/probit.R") > reg=glm(Y~X1+X2,family=binomial)
If we use R’s diagnostic plot, the first one is the scatterplot of the residuals, against predicted values (the score actually)
> plot(reg,which=1)
we is simply
> plot(predict(reg),residuals(reg)) > abline(h=0,lty=2,col="grey")
Why do we have those two lines of points ? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points) and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship… We can see more clearly what’s going on when we use colors
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y]) > abline(h=0,lty=2,col="grey")
Points are exactly on a smooth curve, as a function of the predicted value,
Now, there is nothing from this graph. If we want to understand, we have to run a local regression, to see what’s going on,
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend, on the graph about that the plain dark line is very close to the dotted line ?
> rl=lm(residuals(reg)~bs(predict(reg),8)) > #rl=loess(residuals(reg)~predict(reg)) > y=predict(rl,se=TRUE) > segments(predict(reg),y$fit+2*y$se.fit,predict(reg),y$fit-2*y$se.fit,col="green")
Yes, we can.And even if we have a guess that something can be done, what would this graph suggest ?
Actually, that graph is probably not the only way to look at the residuals. What not plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get
> plot(X2,residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(X2,residuals(reg)),col="black",lwd=2) > lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red") > abline(h=0,lty=2,col="grey")
The graph is similar to the one we had earlier, and against, there is not much to say,
If we now look at the relationship with the first one, it starts to be more interesting,
> plot(X1,residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(X1,residuals(reg)),col="black",lwd=2) > lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red") > abline(h=0,lty=2,col="grey")
since we can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect,
Now, if we run a regression including this quadratic effect, what do we have,
> reg=glm(Y~X1+I(X1^2)+X2,family=binomial) > plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red") > lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2) > abline(h=0,lty=2,col="grey")
Actually, it looks like we back where we were initially…. So what is my point ? my point is that
- graphs (yes, plural) can be used to see what might go wrong, to get more intuition about possible non linear transformation
- graphs are not everything, and they never be perfect ! Here, in theory, to plain line should be a straight line, horizontal. But we also want a model as simple as possible. So, at some stage, we should probably give up, and rely on statistical tests, and confidence intervals. Yes, almost a flat line can be interpreted as flat.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.