Very Non-Standard Calling in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Our group has done a lot of work with non-standard calling conventions in R.
Our tools work hard to eliminate non-standard calling (as is the purpose of wrapr::let()), or at least make it cleaner and more controllable (as is done in the wrapr dot pipe).  And even so, we still get surprised by some of the side-effects and mal-consequences of the over-use of non-standard calling conventions in R.
Please read on for a recent example.
Consider the following calls to stats::lm().  And notice the third example fails (throws an error).
# works
lm("y ~ x", 
   data = data.frame(
     x=1:5, 
     y = c(1, 1, 2, 2, 2)), 
   weights = numeric(5)+1)
#> 
#> Call:
#> lm(formula = "y ~ x", data = data.frame(x = 1:5, y = c(1, 1, 
#>     2, 2, 2)), weights = numeric(5) + 1)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         0.7          0.3
# works
f1 <- function(w = NULL) {
  lm(as.formula("y ~ x"), 
     data = data.frame(
       x=1:5, 
       y = c(1, 1, 2, 2, 2)), 
     weights = w)
}
f1(numeric(5)+1)
#> 
#> Call:
#> lm(formula = as.formula("y ~ x"), data = data.frame(x = 1:5, 
#>     y = c(1, 1, 2, 2, 2)), weights = w)
#> 
#> Coefficients:
#> (Intercept)            x  
#>         0.7          0.3
# fails
f2 <- function(w = NULL) {
  lm("y ~ x", 
     data = data.frame(
       x=1:5, 
       y = c(1, 1, 2, 2, 2)), 
     weights = w)
}
f2(numeric(5)+1)
#> Error in eval(extras, data, env): object 'w' not found
According the stats::lm() documentation (help(lm)) the first argument must be:
an object of class “formula” (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.
A string appears to be coerce-able into a formula, so all three examples should work.  However, typing “print(lm)” reveals the issue: stats::lm() doesn’t take the “weights” argument in a standard way (as the value of a function parameter).  It instead grabs it through a sequence of match.call() and eval() steps.  It is a complicated way to get the value, which works until it does not work.  Somehow passing in the formula as a string interferes with how the value of weights is found. I think we can now see the benefits of isolation and independence of concerns in code.
This over-use of direct environment copying and manipulation is what leads to a great many data-leaks in stats::lm() and stats::glm().  This is in addition to their weird habit of keeping a copy of all of the training data (which loses quite a few of the merits of these methods). Our group dealt with these issues a long time ago, so we are somewhat familiar with stats::lm() and stats::glm().
Of course, one could (as the stats::lm() documentation mentions) call stats::lm.fit().  However, stats::lm.fit() does not seem to accept weights and its own documentation (help(lm.fit)) starts ominously:
These are the basic computing engines called by lm used to fit linear models. These should usually not be used directly unless by experienced users.
Having just finished teaching a four day intensive course covering data science in Python, I can’t help but remark that users of sklearn.linear_model.LinearRegression() don’t need to worry about issues such as the above.  Some of the notational flair of R comes at the cost of significant opportunities for user confusion.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
