qeML Example: Issues of Overfitting, Dimension Reduction Etc.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved.
We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh?
Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function names; see https://tinyurl.com/4hwr2vf.) The idea is simple: For each variable, find prediction accuracy with and without that variable.
Let’s try it on the famous NYC taxi trip data, included (with modification) in qeML. First, note that qeML prediction calls automatically split the data into training and test sets, and compute test accuracy (mean absolute prediction error or overall misclassification error) on the latter.
The call qeLeaveOut1Var(nyctaxi,’tripTime’,’qeLin’,10) predicts trip time using qeML‘s linear model. (The latter wraps lm, but adds some things and sets the standard qeML call form..) Since the test set is random (as is our data), we’ll do 10 repetitions and average the results. Instead of qeLin, we could have used any other qeML prediction function, e.g. qeKNN for k-Nearest Neighbors.
> qeLeaveOut1Var(nyctaxi,'tripTime','qeLin',10) full trip_distance PULocationID DOLocationID DayOfWeek 238.4611 353.2409 253.2761 246.3186 239.2277 There were 50 or more warnings (use warnings() to see the first 50)
We’ll discuss the warnings shortly, but not surprisingly, trip distance is the most important variable. The pickup and dropoff locations also seem to have predictive value, though day of the week may not.
But let’s take a closer look. There were 224 pickup locations. (run levels(nyctaxi$PULocationID) to see this). That’s 223 dummy (“one-hot”) variables; are some more predictive than others? To explore that in qeLeaveOut1Var, we could make the dummies explicit, so each dummy is removed one at a time:
nyct <- factorsToDummies(nyctaxi,omitLast=TRUE)
This function is actually from the regtools package, included in qeML. Then we could try, say,
nyct <- as.data.frame(nyct) qeLeaveOut1Var(nyct,'tripTime','qeLin',10)
But with so many dummies, this would take a long time to run. We could directly look at mean trip times for each pickup location to get at least some idea of their individual predictive power,
tapply(nyctaxi$tripTime,nyctaxi$PULocationID,mean) tapply(nyctaxi$tripTime,nyctaxi$PULocationID,length)
Many locations have very little data, so we’d have to deal with that. Note too the possibility of overfitting.
> dim(nyct) [1] 10000 479
An old rule of thumb is to use under sqrt(n) variables, 100 here. Just a guide, but much less than 479. (Note: Even our analysis using the original factors still converts to dummies internally; nyctaxi has 4 columns, but lm will expand them as in nyct.)
We may wish to delete pickup location entirely. Or, possibly use PCA for dimension reduction,
z <- qePCA(nyctaxi,'tripTime','qeLin',pcaProp=0.75)
This qeML call says, “Compute PCA on the predictors, retaining enough of them for 0.75 of the total variance, and then run qeLin on the resulting PCs.”
But…remember those warning messages? Running warnings() we see messages like “6 rows removed from test set, due to new factor levels.” The problem is that, in dividing the data into training and test sets, some pickup or dropoff locations appeared only in the latter, thus impossible to predict. So, many of the columns in the training set are all 0s, thus 0 variance, thus problems with PCA. We then might run qeML::constCols to find out which columns have 0 variance, then delete those, and try qePCA again.
And we haven’t even mentioned using, say, qeLASSO or qeXGBoost instead of qeLin, etc. But the point is clear: Even a very simple, very clean-looking application like this one may be much more nuanced than it looks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.