Border bias and weighted kernels
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With Ewen (aka @3wen), not only we have been playing on Twitter this month, we have also been working on kernel estimation for densities of spatial processes. Actually, it is only a part of what he was working on, but that part on kernel estimation has been the opportunity to write a short paper, that can now be downloaded on hal.
The problem with kernels is that kernel density estimators suffer a strong bias on borders. And with geographic data, it is not uncommon to have observations very close to the border (frontier, or ocean). With standard kernels, some weight is allocated outside the area: the density does not sum to one. And we should not look for a global correction, but for a local one. So we should use weighted kernel estimators (see on hal for more details). The problem that weights can be difficult to derive, when the shape of the support is a strange polygon. The idea is to use a property of product Gaussian kernels (with identical bandwidth) i.e. with the interpretation of having noisy observation, we can use the property of circular isodensity curve. And this can be related to Ripley (1977) circumferential correction. And the good point is that, with R, it is extremely simple to get the area of the intersection of two polygons. But we need to upload some R packages first,
require(maps) require(sp) require(snow) require(ellipse) require(ks) require(gpclib) require(rgeos) require(fields)
To be more clear, let us illustrate that technique on a nice example. For instance, consider some bodiliy injury car accidents in France, in 2008 (that I cannot upload but I can upload a random sample),
base_cara=read.table( "http://freakonometrics.blog.free.fr/public/base_fin_morb.txt", sep=";",header=TRUE)
The border of the support of our distribution of car accidents will be the contour of the Finistère departement, that can be found in standard packages
geoloc=read.csv( "http://freakonometrics.free.fr/popfr19752010.csv", header=TRUE,sep=",",comment.char="",check.names=FALSE, colClasses=c(rep("character",5),rep("numeric",38))) geoloc=geoloc[,c("dep","com","com_nom", "long","lat","pop_2008")] geoloc$id=paste(sprintf("%02s",geoloc$dep), sprintf("%03s",geoloc$com),sep="") geoloc=geoloc[,c("com_nom","long","lat","pop_2008")] head(geoloc) france=map('france',namesonly=TRUE, plot=FALSE) francemap=map('france', fill=TRUE, col="transparent", plot=FALSE) detpartement_bzh=france[which(france%in% c("Finistere","Morbihan","Ille-et-Vilaine", "Cotes-Darmor"))] bretagne=map('france',regions=detpartement_bzh, fill=TRUE, col="transparent", plot=FALSE,exact=TRUE) finistere=cbind(bretagne$x[321:678],bretagne$y[321:678]) FINISTERE=map('france',regions="Finistere", fill=TRUE, col="transparent", plot=FALSE,exact=TRUE) monFINISTERE=cbind(FINISTERE$x[c(8:414)],FINISTERE$y[c(8:414)])
Now we need simple functions,
cercle=function(n=200,centre=c(0,0),rayon) {theta=seq(0,2*pi,length=100) m=cbind(cos(theta),sin(theta))*rayon m[,1]=m[,1]+centre[1] m[,2]=m[,2]+centre[2] names(m)=c("x","y") return(m)} poids=function(x,h,POL) {leCercle=cercle(centre=x,rayon=5/pi*h) POLcercle=as(leCercle, "gpc.poly") return(area.poly(intersect(POL,POLcercle))/ area.poly(POLcercle))} lissage = function(U,polygone,optimal=TRUE,h=.1) {n=nrow(U) IND=which(is.na(U[,1])==FALSE) U=U[IND,] if(optimal==TRUE) {H=Hpi(U,binned=FALSE); H=matrix(c(sqrt(H[1,1]*H[2,2]),0,0, sqrt(H[1,1]*H[2,2])),2,2)} if(optimal==FALSE){H= matrix(c(h,0,0,h),2,2)
before defining our weights.
poidsU=function(i,U,h,POL) {x=U[i,] poids(x,h,POL)} OMEGA=parLapply(cl,1:n,poidsU,U=U,h=sqrt(H[1,1]), POL=as(polygone, "gpc.poly")) OMEGA=do.call("c",OMEGA) stopCluster(cl) }else {OMEGA=lapply(1:n,poidsU,U=U,h=sqrt(H[1,1]), POL=as(polygone, "gpc.poly")) OMEGA=do.call("c",OMEGA)}
Note that it is possible to parallelize if there are a lot of observations,
if(n>=500) {cl <- makeCluster(4,type="SOCK") worker.init <- function(packages) {for(p in packages){library(p, character.only=T)} NULL} clusterCall(cl, worker.init, c("gpclib","sp")) clusterExport(cl,c("cercle","poids"))
Then, we can use standard bivariate kernel smoothing functions, but with the weights we just calculated, using a simple technique that can be related to one suggested in Ripley (1977),
fhat=kde(U,H,w=1/OMEGA,xmin=c(min(polygone[,1]), min(polygone[,2])),xmax=c(max(polygone[,1]), max(polygone[,2]))) fhat$estimate=fhat$estimate*sum(1/OMEGA)/n vx=unlist(fhat$eval.points[1]) vy=unlist(fhat$eval.points[2]) VX = cbind(rep(vx,each=length(vy))) VY = cbind(rep(vy,length(vx))) VXY=cbind(VX,VY) Ind=matrix(point.in.polygon(VX,VY, polygone[,1], polygone[,2]),length(vy),length(vx)) f0=fhat f0$estimate[t(Ind)==0]=NA return(list( X=fhat$eval.points[[1]], Y=fhat$eval.points[[2]], Z=fhat$estimate, ZNA=f0$estimate, H=fhat$H, W=fhat$W))} lissage_without_c = function(U,polygone,optimal=TRUE,h=.1) {n=nrow(U) IND=which(is.na(U[,1])==FALSE) U=U[IND,] if(optimal==TRUE) {H=Hpi(U,binned=FALSE); H=matrix(c(sqrt(H[1,1]*H[2,2]),0,0,sqrt(H[1,1]*H[2,2])),2,2)} if(optimal==FALSE){H= matrix(c(h,0,0,h),2,2)} fhat=kde(U,H,xmin=c(min(polygone[,1]), min(polygone[,2])),xmax=c(max(polygone[,1]), max(polygone[,2]))) vx=unlist(fhat$eval.points[1]) vy=unlist(fhat$eval.points[2]) VX = cbind(rep(vx,each=length(vy))) VY = cbind(rep(vy,length(vx))) VXY=cbind(VX,VY) Ind=matrix(point.in.polygon(VX,VY, polygone[,1], polygone[,2]),length(vy),length(vx)) f0=fhat f0$estimate[t(Ind)==0]=NA return(list( X=fhat$eval.points[[1]], Y=fhat$eval.points[[2]], Z=fhat$estimate, ZNA=f0$estimate, H=fhat$H, W=fhat$W))}
So, now we can play with those functions,
base_cara_FINISTERE=base_cara[which(point.in.polygon( base_cara$long,base_cara$lat,monFINISTERE[,1], monFINISTERE[,2])==1),] coord=cbind(as.numeric(base_cara_FINISTERE$long), as.numeric(base_cara_FINISTERE$lat)) nrow(coord) map(francemap) lissage_FIN_withoutc=lissage_without_c(coord, monFINISTERE,optimal=TRUE) lissage_FIN=lissage(coord,monFINISTERE, optimal=TRUE) lesBreaks_sans_pop=range(c( range(lissage_FIN_withoutc$Z), range(lissage_FIN$Z))) lesBreaks_sans_pop=seq(min(lesBreaks_sans_pop)*.95, max(lesBreaks_sans_pop)*1.05,length=21) plot_article=function(lissage,breaks, polygone,coord){ par(mar=c(3,1,3,1)) image.plot(lissage$X,lissage$Y,(lissage$ZNA), xlim=range(polygone[,1]),ylim=range(polygone[,2]), breaks=breaks, col=rev(heat.colors(20)),xlab="", ylab="",xaxt="n",yaxt="n",bty="n",zlim=range(breaks), horizontal=TRUE) contour(lissage$X,lissage$Y,lissage$ZNA,add=TRUE, col="grey") points(coord[,1],coord[,2],pch=19,cex=.1, col="dodger blue") polygon(polygone,lwd=2,)} plot_article(lissage_FIN_withoutc,breaks= lesBreaks_sans_pop,polygone=monFINISTERE, coord=coord) plot_article(lissage_FIN,breaks= lesBreaks_sans_pop,polygone=monFINISTERE, coord=coord)
If we look at the graphs, we have the following densities of car accident, with a standard kernel on the left, and our proposal on the right (with local weight adjustment when the estimation is done next to the border of the region of interest),
With those modified kernels, hot spots appear much more clearly. For more details, the paper is online on hal.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.