Some heuristics about local regression and kernel smoothing
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a standard linear model, we assume that . Alternatives can be considered, when the linear assumption is too strong.
- Polynomial regression
A natural extension might be to assume some polynomial function,
Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters can be obtained using least squares, where a regression of on is considered.
Even if this polynomial model is not the real one, it might still be a good approximation for . Actually, from Stone-Weierstrass theorem, if is continuous on some interval, then there is a uniform approximation of by polynomial functions.
Just to illustrate, consider the following (simulated) dataset
set.seed(1) xr = seq(0,n,by=.1) yr = sin(xr/2)+rnorm(length(xr))/2 db = data.frame(x=xr,y=yr) plot(db)
with the standard regression line
reg = lm(y ~ x,data=db) abline(reg,col="red")
Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,
reg=lm(y~poly(x,5),data=db)
But if the degree is too large, then too many ‘oscillations’ are obtained,
reg=lm(y~poly(x,25),data=db)
and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes
plot(db) attach(db) lines(xr,predict(reg),col="red",lty=2) yrm=yr;yrm[31]=yr[31]-2 regm=lm(yrm~poly(xr,25)) lines(xr,predict(regm),col="red")
- Local regression
Actually, if our interest is to have locally a good approximation of , why not use a local regression?
This can be done easily using a weighted regression, where, in the least square formulation, we consider
(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:
- here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is
which can be solve explicitly, since
- so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point , then should be proportional to some distance between and : if is too far from , then it should not have to much influence on the prediction.
For instance, if we want to have a prediction at some point , consider . With this model, we remove observations too far away,
Actually, here, it is the same as
reg=lm(yr~xr,subset=which(abs(xr-x0)<1)
A more general idea is to consider some kernel function that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that
This is actually the so-called Nadaraya-Watson estimator of function .
In the previous case, we did consider a uniform kernel , with bandwith ,
But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,
This can be done using
fitloc0 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~1,data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
On our dataset, we can plot
ul=seq(0,10,by=.01) vl0=Vectorize(fitloc0)(ul) u0=seq(-2,7,by=.01) linearlocalconst=function(x0){ w=dnorm((xr-x0)) plot(db,cex=abs(w)*4) lines(ul,vl0,col="red") axis(3) axis(2) reg=lm(y~1,data=db,weights=w) u=seq(0,10,by=.02) v=predict(reg,newdata=data.frame(x=u)) lines(u,v,col="red",lwd=2) abline(v=c(0,x0,10),lty=2) } linearlocalconst(2)
Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression
Let us use an animation to visualize the construction of the curve. One can use
library(animate)
but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs
vx0=seq(1,9,by=.1) vx0=c(vx0,rev(vx0)) graphloc=function(i){ name=paste("local-reg-",100+i,".png",sep="") png(name,600,400) linearlocalconst(vx0[i]) dev.off()} for(i in 1:length(vx0)) graphloc(i)
and then, in a terminal, I simply use
convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif
Of course, it is possible to consider a linear model, locally,
fitloc1 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~poly(x,degree=1),data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
or even a quadratic (local) regression,
fitloc2 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~poly(x,degree=2),data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
Of course, we can change the bandwidth
To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider
where is the prediction obtained using a local regression technique, with bandwidth . And to get a more accurate (and optimal) bandwith is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…
Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.
library(XML) library(downloader) file = "geos-tww.csv" html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all") html = xpathApply(html, "//table[@id='collectionTable']")[[1]] data = readHTMLTable(html) data = data[,-3] names(data)=c("no",names(data)[-1]) data=data[-(61:64),]
Let us reshape the dataset,
data$no = 1:96 data$mu = as.numeric(substr(as.character(data$Mean), 0, 4)) data$se = sd(data$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data$Count))) data$season = 1 + (data$no - 1)%/%12 data$season = factor(data$season) plot(data$no,data$mu,ylim=c(6,10)) segments(data$no,data$mu-1.96*data$se, data$no,data$mu+1.96*data$se,col="light blue")
As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.
plot(data$no,data$mu,ylim=c(6,10)) abline(v=12*(0:8)+.5,lty=2) for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s) lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }
Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.
db = data NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=5) plot(data$no,data$mu) lines(NW,col="red")
We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,
db$mu[95]=7 NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=12) plot(data$no,data$mu,ylim=c(6,10)) lines(NW,col="red")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.