Playing with quantiles, part 1
[This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A standard idea in extreme value theory (see e.g. here,
in French unfortunately) is that to estimate the 99.5%
quantile (say), we just need to estimate a quantile of level 95% for
observations
exceeding the 90% quantile.
In extreme value theory, we assume that the 90% quantile (of the initial distribution) can be obtained easily, e.g. the empirical quantile, and then, for the exceeding observations, we fit a Pareto distribution (a Generalized Pareto one to be precise), and get a parametric quantile for the 95% quantile. I.e.
data:image/s3,"s3://crabby-images/03d00/03d005cbeb59f29abc42cf0bae53add3dab6314a" alt="http://freakonometrics.blog.free.fr/public/perso2/quant01.gif"
data:image/s3,"s3://crabby-images/17412/17412f2bd2e3999fe3abbb876e6797f95cf4437e" alt="http://freakonometrics.blog.free.fr/public/perso2/quant02.gif"
data:image/s3,"s3://crabby-images/45db1/45db189643f0da0f06bfe637fb435d1ea974a044" alt="http://freakonometrics.blog.free.fr/public/perso2/quant03.gif"
data:image/s3,"s3://crabby-images/15e27/15e27427288d19c2d64a8a3e1d0a114e2e241973" alt="http://freakonometrics.blog.free.fr/public/perso2/quant04b.gif"
data:image/s3,"s3://crabby-images/99081/9908159671148cae919e3d2e4da8fef9d7bd4ae2" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif"
data:image/s3,"s3://crabby-images/53a7d/53a7d056718f3ff46a04f053d52bbfaa1423f723" alt="http://freakonometrics.blog.free.fr/public/perso2/qqqo5.gif"
data:image/s3,"s3://crabby-images/e4695/e4695e40b8012f7a339b50017664e770a1afcb2c" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq04.gif"
data:image/s3,"s3://crabby-images/e0636/e0636beafe068a84d7b08d3468787f40453c1599" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq07.gif"
data:image/s3,"s3://crabby-images/3cd91/3cd918faddd81b52765ed6e0b9ceda6f14d1e05f" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq08.gif"
data:image/s3,"s3://crabby-images/99081/9908159671148cae919e3d2e4da8fef9d7bd4ae2" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif"
data:image/s3,"s3://crabby-images/99081/9908159671148cae919e3d2e4da8fef9d7bd4ae2" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif"
If I want to get the 90% quantile regression, and the 10% quantile, the code is simply,
library(mnormt) library(quantreg) library(splines) set.seed(1) mu=c(0,0) r=0 Sigma <- matrix(c(1,r,r,1), 2, 2) Z=rmnorm(2500,mu,Sigma) X=Z[,1] Y=Z[,2] base=data.frame(X,Y) plot(X,Y,col="blue",cex=.7) I=(Y>qnorm(.25))&(Y<qnorm(.75)) baseI=base[I==FALSE,] points(X[I],Y[I],col="light blue",cex=.7) abline(h=qnorm(.25),lty=2,col="blue") abline(h=qnorm(.75),lty=2,col="blue") u=seq(-5,5,by=.02) reg=rq(Y~X,data=base,tau=.05) lines(u,predict(reg,newdata=data.frame(X=u)),lty=2) reg=rq(Y~X,data=baseI,tau=.05*2) lines(u,predict(reg,newdata=data.frame(X=u)))The graph is the following
But what if observations
data:image/s3,"s3://crabby-images/53a7d/53a7d056718f3ff46a04f053d52bbfaa1423f723" alt="http://freakonometrics.blog.free.fr/public/perso2/qqqo5.gif"
data:image/s3,"s3://crabby-images/99081/9908159671148cae919e3d2e4da8fef9d7bd4ae2" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq06.gif"
data:image/s3,"s3://crabby-images/e9073/e90733b9dee3cd03a132cff784b4c6bcc3cbe277" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq09.gif"
data:image/s3,"s3://crabby-images/6e991/6e991f4e181e42fb4dd5b8720e8cb7bf44f212e8" alt="http://freakonometrics.blog.free.fr/public/perso2/qqq10.gif"
But why could that be interesting ? Well, because I wanted to run a quantile regression on marathon results. But I could not get the overall dataset (since I had to import observations manually, and I have to admit that it was a bit boring). So I extracted finish times of the first 10% athletes, and the latest 10%. And I was wondering if it was enough to look at the 5% and 95% quantiles, based on the age of the runner... To be continued.
To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.