Quantile Regression (home made, part 2)

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A few months ago, I posted a note with some home made codes for quantile regression… there was something odd on the output, but it was because there was a (small) mathematical problem in my equation. So since I should teach those tomorrow, let me fix them.

Median

Consider a sample {y1,,yn}. To compute the median, solveminμ{ni=1|yiμ|}which can be solved using linear programming techniques. More precisely, this problem is equivalent tominμ,a,b{ni=1ai+bi}with ai,bi0 and yiμ=aibi, i=1,,n. Heuristically, the idea is to write yi=μ+εi, and then define ai‘s and bi‘s so that εi=aibi and |εi|=ai+bi, i.e. ai=(εi)+=max{0,εi}=|ε|1εi>0and\(b_i=(-\varepsilon_i)_+=\max\lbrace0,-\varepsilon_i\rbrace=|\varepsilon|\cdot\boldsymbol{1}_{\varepsilon_i<0}[/latex]denote respectively the positive and the negative parts.

Unfortunately (that was the error in my previous post), the expression of linear programs is[latex display="true"]\min_{\mathbf{z}}\left\lbrace\boldsymbol{c}^\top\mathbf{z}\right\rbrace\text{ s.t. }\boldsymbol{A}\mathbf{z}=\boldsymbol{b},\mathbf{z}\geq\boldsymbol{0}\)In the equation above, with the ai‘s and bi‘s, we’re not far away. Except that we have μR, while it should be positive. So similarly, set μ=μ+μ where μ+=(μ)+ and μ=(μ)+.

Thus, letz=(μ+;μ;a,b)R2n+2+and then write the constraint as Az=b with b=y and A=[1n;1n;In;In]And for the objective functionc=(0,1n,1n)R2n+2+

To illustrate, consider a sample from a lognormal distribution,

n = 101 
set.seed(1)
y = rlnorm(n)
median(y)
[1] 1.077415

For the optimization problem, use the matrix form, with 3n constraints, and 2n+1 parameters,

library(lpSolve) 
X = rep(1,n) 
A = cbind(X, -X, diag(n), -diag(n))
b = y
c = c(rep(0,2), rep(1,n),rep(1,n))
equal_type = rep("=", n) 
r = lp("min", c,A,equal_type,b)
head(r$solution,1)
[1] 1.077415

It looks like it’s working well…

Quantile

Of course, we can adapt our previous code for quantiles

tau = .3
quantile(y,tau)
      30% 
0.6741586

The linear program is nowminq+,q,a,b{ni=1τai+(1τ)bi}with ai,bi,q+,q0 and yi=q+q+aibi, i=1,,n. The R code is now

c = c(rep(0,2), tau*rep(1,n),(1-tau)*rep(1,n))
r = lp("min", c,A,equal_type,b)
head(r$solution,1)
[1] 0.6741586

So far so good…

Quantile Regression

Consider the following dataset, with rents of flat, in a major German city, as function of the surface, the year of construction, etc.

base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE)

The linear program for the quantile regression is nowminβ+,β,a,b{ni=1τai+(1τ)bi}with ai,bi0 and yi=x[β+β]+aibii=1,,n and β+j,βj0 j=0,,k. So use here

require(lpSolve) 
tau = .3
n=nrow(base)
X = cbind( 1, base$area)
y = base$rent_euro
K = ncol(X)
N = nrow(X)
A = cbind(X,-X,diag(N),-diag(N))
c = c(rep(0,2*ncol(X)),tau*rep(1,N),(1-tau)*rep(1,N))
b = base$rent_euro
const_type = rep("=",N)
r = lp("min",c,A,const_type,b)
beta = r$sol[1:K] -  r$sol[(1:K+K)]
beta
[1] 148.946864   3.289674

Of course, we can use R function to fit that model

library(quantreg)
rq(rent_euro~area, tau=tau, data=base)
Coefficients:
(Intercept)        area 
 148.946864    3.289674

Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot

plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")),
     ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5)
sf=0:250
yr=r$solution[2*n+1]+r$solution[2*n+2]*sf
lines(sf,yr,lwd=2,col="blue")
tau = .9
r = lp("min",c,A,const_type,b)
tail(r$solution,2)
[1] 121.815505   7.865536
yr=r$solution[2*n+1]+r$solution[2*n+2]*sf
lines(sf,yr,lwd=2,col="blue")

And we can adapt the later to multiple regressions, of course,

X = cbind(1,base$area,base$yearc)
K = ncol(X)
N = nrow(X)
A = cbind(X,-X,diag(N),-diag(N))
c = c(rep(0,2*ncol(X)),tau*rep(1,N),(1-tau)*rep(1,N))
b = base$rent_euro
const_type = rep("=",N)
r = lp("min",c,A,const_type,b)
beta = r$sol[1:K] -  r$sol[(1:K+K)]
beta
[1] -5542.503252     3.978135     2.887234

to be compared with

library(quantreg)
rq(rent_euro~ area + yearc, tau=tau, data=base)

Coefficients:
 (Intercept)         area        yearc 
-5542.503252     3.978135     2.887234 

Degrees of freedom: 4571 total; 4568 residual

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)