Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post.
My goals in this post, as in the previous one, are to introduce the capabilities of qeML and to point out some general ML issues. The key example of the latter here is the fact that leaves in an RF tree are very similar to neighborhoods in k-NN, which implies that in principle one should be able to do QR in an RFs context, just as we did last time with k-NN.
However, as the saying goes, “Easier said than done.” What was key in the kNN case last time was that the qeKNN function argument smoothingFtn gives the user access to the neighborhoods, in that it allows the user to specify a function that performs a user-requested operation in each neighborhood; smoothingFtn offers a local-linear option, for instance, and in the last post I showed how one could achieve QR via a user-written function.
The situation for RFs is not so simple. The problem is that typical RF software does not provide “hooks” directly analogous to smoothingFtn. Some implementations do provide some useful hooks that could play a role, such as randomForests::getTree, but putting them together for the desired result may not be easy, especially given ambiguities in the documentation.
Fortunately, the grf package includes a QR app. The qeML function qeRFgrf originally wrapped the “ordinary” and local linear options in grf, and I’ve now added QR in v.1.2.
The name ‘grf’ stands for “Generalized Random Forests,” with the main generalizing being similar to smoothingFtn, i.e. to allow functions other than the mean to be applied to the data in the leaves. A second generalization aspect is to tailor the node-splitting process to the type of smoothing done in the leaves.
In particular, grf includes the function quantile_forest, providing just what our reader inquired about. One specifies the quantiles of interest in an argument quantiles, and later calls the paired predict function to obtain the estimated quantiles of “Y” at requested values of the “X” variables.
The qeML package has an interface to grf, as the function qeRFgrf. To access the QR option (qeML v.1.2), set the qeRFgrf argument quantls to a nonnull value. Here is an example using the North American major league baseball players data (included in qeML with the permission of the UCLA Stat Dept.). We find the 20th, 40th, 60th and 80th percentiles of weight, for each height.
library(qeML) data(mlb1) z <-qeRFgrf(mlb1[,2:3],'Weight',quantls=c(0.2,0.4,0.6,0.8),holdout=NULL) w <- predict(z,mlb1[,2,drop=F]) df1 <- data.frame(x=mlb1[,2,drop=F],y=w[,1],z='0.20') df2 <- data.frame(x=mlb1[,2,drop=F],y=w[,2],z='0.40') df3 <- data.frame(x=mlb1[,2,drop=F],y=w[,3],z='0.60') df4 <- data.frame(x=mlb1[,2,drop=F],y=w[,4],z='0.80') dfall <- rbind(df1,df2,df3,df4) qeML:::qePlotCurves(dfall,xlab='ht',ylab='wt')
The convenience function qePlotCurves is essentially the code I used in the previous post, now added to v.1.2.
I highly recommend the grf package. My attention was immediately drawn to it when it first came out, as I was pleased to see that I could now do analysis in RFs using non-mean smoothing, as I had been doing with qeKNN. It was written by some top researchers, who also developed the supporting theory.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.