My residuals look weird… aren’t they ?
[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Since I got the same question twice, let us look at it quickly…. Some students show me a graph (from a Poisson regression) which looks like that,
and they asked “isn’t it weird ?“, i.e.”residuals are null or positive… this is not what we should have, right ?“. Actually, residuals are always centered in a glm regression (if you keep the intercept). Let us generate a dataset,> n=500
> set.seed(1)
> X=runif(n)
> E=rnorm(n)*.3
> XB=-5+2*X+E
> lambda=exp(XB)
> Y=rpois(n,lambda)
> reg=glm(Y~X,family=poisson)
> plot(predict(reg),residuals(reg,”pearson”))
If we look carefully, residuals in the lower part, they are strictly negative… And since, on average, we should have zero, then we are very close to 0.
> mean(residuals(reg,”pearson”))
[1] -0.003381202
Actually, the true value is either
> u=seq(-10,2,by=.1)
> lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
or
> lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)
So what happens ? Here, the value of the parameter (the expected value of our Poisson distribution) is very small
> mean(Y)
[1] 0.026
A crude estimator for the glm would be to consider a regression only on the intercept. In that case, the error would be either 0-0.026, or 1-0.026, i.e. for 97.5%, we have -0.026 (which is small, close to 0, but negative), while for 2.5% we have 0.974 (which is huge compared to the previous value). Well, on that graph, we plot Pearson’s residuals, i..e. we divide the difference by the expected value… so we have those two curves….
Actually, there are only two clouds since the probability to have 1 (when lambda is close to 2.5%) is small, and extremely small to exceed 2…
On the graph below, the average value of Y increases (and then decreases): initially, almost all the points are on the red curve (i.e. we observe 0, and predict something extremely small), then we have more and more points on the blue curve (for more and more individuals we have 1), then we start to see some on the green curve (i.e. 2), etc. Note that areas where we observe clouds remain unchanged (i.e. the colored curves).
> P=seq(-5,-2,by=.05)
> P=c(P,rev(P))
> for(i in 1:length(P)){
+ set.seed(1)
+ XB=P[i]+2*X+E
+ lambda=exp(XB)
+ Y=rpois(n,lambda)
+ reg=glm(Y~X,family=poisson)
+ u=seq(-10,2,by=.1)
+ plot(predict(reg),residuals(reg,”pearson”),ylim=c(-1,5),xlim=c(-6,0))
+ abline(h=0)
+ lines(u,(0-exp(u))/sqrt(exp(u)),lwd=.5,col=”red”)
+ lines(u,(1-exp(u))/sqrt(exp(u)),lwd=.5,col=”blue”)
+ lines(u,(2-exp(u))/sqrt(exp(u)),lwd=.5,col=”green”)
+ rlines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(3-exp(u))/sqrt(exp(u)),lwd=.5,col=”orange”)
+ lines(u,(4-exp(u))/sqrt(exp(u)),lwd=.5,col=”purple”)
+ points(predict(reg),residuals(reg,”pearson”))
+ }
So the graph is perfectly normal since we have a discrete distribution (while we predict something continuous).
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.