Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R’s model formula infrastructure was discussed in my previous post. Despite the elegance and convenience of the formula method, there are some aspects that are limiting.
Limitations to Extensibility
The model formula interface does have some limitations:
- It can be kludgy with many operations on many variables (e.g., log transforming 50 variables via a formula without using
paste
) - The
predvars
aspect (discussed in my previous post) limits the utility of the operations. Suppose a formula had:knn_impute(x1) + knn_impute(x2)
. Do we embed the training set twice inpredvars
? - Operations are constrained to single columns or features (excluding interaction specifications). For example, you cannot do
I’ll use PCA feature extraction a few times here since it is probably familiar to many readers.
Everything Happens at Once
Some of our data operations might be sequential. For example, it is not unreasonable to have predictors that require:
- imputation of a missing value
- centering and scale
- conversion to PCA scores
Given that the formula method operations happen (in effect) at once, this workflow requires some sort of custom solution. While caret::preProcess
was designed for this sequence of operations, it does so in a single call, as opposed to a progression of steps exemplified by ggplot2
, dplyr
, or magrittr
.
Allowing a series of steps to be defined in order is more consistent with how data analysis is conducted. However, it does raise the complexity of the underlying implementation. For example, caret::preProcess
dictates the possible sequence of tasks to be: filters, single-variable transformations, normalizations, imputation, signal extraction, and spatial sign. This avoids nonsensical sequences that center the data before applying a Box-Cox calculation (which requires positive data).
No Recycling
As a corollary to the point above, there is no way to recycle the terms
between models that share the same formula and data/environment. For example, if I fit a CART model to a data set with many predictors, the random forest model (theoretically) shouldn’t need to recreate the same terms
information about the design matrix. If the model function has the non-formula interface (e.g., mod_func(x, y)
), this can make it easier. However, many do not.
Also, suppose that one of the pre-processing steps is computationally expensive. We’d like to be able to store the state of the results and then add another layer of computations (perhaps as a separate object).
Formulas and Wide Datasets
The terms
object saves a matrix with as many rows as formula variables and at least as many columns (depending on interactions, etc). Most of this data is zero and a non–sparse representation is used. The current framework was built in a time where there was more focus on interactions, nesting and other operations on a small scale.
It is unlikely that models would have hundreds of interaction terms, but now it is not uncommon to have hundreds or thousands of main effects. As the number of predictors increases, this takes up an inordinate amount of execution time. For simple randomForest
or rpart
calls, the formula/terms
work can account for most of the execution time. For example, we can calculate how much time functions spend generating the model matrix relative to the total execution time. For rpart
and randomForest
, we used the default arguments and did the calculations with a simulated data set of 200 data points and varying numbers of predictors:
This is especially problematic for ensemble models. For example, ipred:::ipredbagg
creates an ensemble of rpart
trees. Since rpart
only has a formula method, the footprint of the bagged model object can become very large if X trees are contained in the ensemble. Alternatively, randomForest.formula
takes the approach of generating the terms
once and feeding the model frame to randomForest.default
. This does not work for rpart
since there is no non-formula method exposed. Some functions (e.g., lm
, survival::coxph
) have arguments that can be used to prevent the terms
and similar objects from being returned. This saves space but prevents new samples from being predicted. A little more detail can be found here.
One issue is the "factors"
attribute of the terms
object (discussed in the previous post). This is a non-sparse matrix that has a row for each predictor in the formula and a column for each model term (e.g. main effects, interactions, etc.). The purpose of this object is to know which predictors are involved in which terms.
The issue is that this matrix can get very large and usually has a high proportion of zeros. For example:
As the number of predictors increases, the rate of ones is likely to approach a value close to zero very quickly. For example:
Again, it is doubtful that a model with a large number of predictors will have a correspondingly large number of high-level interactions (see the Pareto principle applied to modeling).
Variable Roles
Some packages have implemented extensions of the basic formula. There are cases when formula are needed for specific sub-models. For example, a random coefficient model can be fit with the lmer
function. In this case, a model is specified for a particular clustering variable (e.g., a subject in a clinical trial). The code is an example of how lmer
syntax works:
Here Subject
is important to the model-fitting routine, but not as a predictor. Similarly, the Bradley-Terry model can be used to model competitions and contests. A model on a set of boxers in a series of contests can include terms for their reach:
Another extension of basic formulas comes from the modeltools
and mboost
packages. The function mboost::mob
fits a tree-based model with regression models in the terminal nodes. For this model, a separate list of predictors are used as splitting variables (to define the tree structure) and another set of regression variables that are modeled in the terminal nodes. An example of this call is:
The commonality between these three examples is that there are variables that are critical to the model but do not play the role of standard regression terms. For lmer
, Subject
is the independent experimental unit. For mob
, we have variables to be used for splitting, etc.
There are similar issues on the left-hand side of the formula. When there are multivariate outcomes, different packages have different approaches:
The overall point here is that, for the most part, the formula method assumes that there is one variable on the left-hand side of the tilde and that the variables on the right-hand side are predictors (exceptions are discussed below). One can envision other roles that columns could play in the analysis of data. Besides the examples given above, variables could be used for
- outcomes
- predictors
- stratification
- data for assessing model performance (e.g., loan amount to compute expected loss)
- conditioning or faceting variables (e.g.,
lattice
orggplot2
) - random effects or hierarchical model ID variables
- case weights
- offsets
- error terms (limited to
Error
in theaov
function)
The last three items on this list are currently handled in formulas as “specials” or have existing functions. For example, when the model function has a weights
argument, the current formula/terms
frame work uses a function (model.weights
) to extract the weights, and also makes sure that the weights are not included as covariates. The same is true for offsets.
Summary
Some limitations of the current formula interface can be mitigated by writing your own or utilizing the Formula
package.
However, there are a number of conceptual aspects (e.g., roles, sequential processing) that would require a completely different approach to defining a design matrix, and this will be the focus of an upcoming tidyverse package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.