The future of R syntax?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Following Romain François's
example,
I spent last week playing with the definition of the R grammar. I
focused on four changes that I think would improve existing R idioms:
creating lists with bare square brackets; a compact lambda notation;
labelled blocks of code; and of course implementing natively the pipe
operator. While none of these changes are strictly necessary, they
make the language more comfortable to use and nicer to look at. I
provide working implementations for all of them in the brackets
,
brackets-lambda
, labelled
and pipe
branches at
https://github.com/lionel-/r-source.
Bare Square Brackets
Advanced treatments of R programming stress that R is a functional
language. This essentially means that functions are first-class
citizens and that you can pass them as arguments to other
functions. This makes it possible to have the apply family of
functions in base R or the map family in purrr. By the same token,
this makes lists extremely useful in R. They can contain any kind of
objects and you can use functional programming techniques to
manipulate them with expressive idioms. In addition, since lists
elements are associated with names, they can directly map to the
arguments of a function call via do.call()
or purrr::invoke()
,
another key idiom of functional programming in R.
Despite their importance in the R language, lists do not benefit from as much syntax sugar as in other languages. Hence my first change to the R syntax: creating lists with bare square brackets:
[3, 4, letters] #> [[1]] #> [1] 3 #> #> [[2]] #> [1] 4 #> #> [[3]] #> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" #> [18] "r" "s" "t" "u" "v" "w" "x" "y" "z" [3, 4, letters] %>% map_lgl(is.double) #> [1] TRUE TRUE FALSE
This can greatly improve code clarity. Compare dense nested list constructs such as
list( list(1, 2), list(3, 4) )
to the much lighter and cleaner
[[1, 2], [3, 4]]
An important use case that would also benefit from this syntax is when
a function needs some additional arguments in the form of a
list. Think of the contrasts
argument of lm()
or the args
argument of ggplot2's stat_function()
. They both involve passing a
list of arguments, which bloats the calls and makes scripts heavier to
read. The bare brackets notation is a bit lighter:
mtcars$cyl <- as.factor(mtcars$cyl) # Specific contrast for the predictor `cyl` lm(disp ~ cyl + am, data = mtcars, contrasts = [cyl = contr.sum])
I also have a feeling that bare brackets may be useful to come up with
clean creative syntax in DSLs. Like any syntax construct in R, the
square brackets are represented as a plain text function. For example
instead of mtcars[["cyl"]]
, you can write
`[[`(mtcars, "cyl")
. The string for bare brackets is `[]
` and
allows you to redefine its functionality as follows:
`[]` <- function(...) "hello" [3, 4] #> [1] "hello"
By the same token, DSLs could capture bare brackets and give them some specific meaning.
Finally, some additional syntax rule could allow for list
comprehensions by looking up the for
keyword inside bare
brackets. This would enable this kind of python-style code:
# List comprehension: [sum(x)^2 for x in mtcars] # Equivalent to the following map: mtcars %>% map(function(x) sum(x)^2)
However I think that's going a step too far as the functional version is much more R-like.
Lambda Notation
In R functions can be created, given names and passed around. But a
common idiom involves creating anonymous functions (lambda functions)
on the fly. As the full syntax for defining a function can be
cumbersome in those situations, many languages such as
Scala,
Haskell,
F-Sharp,
Python,
and even C++
support a compact notation for creating lambdas. Given the importance
of lambda functions in R (as in the apply
family of functions), it
would be particularly nice to provide an elegant notation for creating
them. The second syntax
update,
relies on the bare square brackets notation for that purpose.
The notation is based on the rightward assignment ->
, an operator
that is barely used in practice because it's a bit confusing. Bare
square brackets followed by ->
followed by any R expression create a
function in place:
[x] -> 3 * x #> [x] -> 3 * x ([x] -> 3 * x)(5) #> [1] 15 lapply(cars, [col] -> max(col / sum(col))) #> $speed #> [1] 0.03246753 #> #> $dist #> [1] 0.05583993
This notation supports variadic lambdas by supplying dots:
variadic <- [...] -> { sum <- ..1 + ..2 sum * 3 } variadic(3, 4) # [1] 21 variadic2 <- [x, y, ...] -> length(list(...)) variadic2("a", "b", 1, 2, 3) # [1] 3
Thanks to operator precedence and the left associativity of ->
,
usual R rules for assignment apply. The following snippet assigns the
lambda first to byproduct
, then to fun
.
fun <- [x] -> x -> byproduct
Labelled Blocks
In R, code is data. When a function is called, its arguments are
usually evaluated and assigned to the parameter. But functions can
also request to see the code used to compute that value in the form of
a quoted expression. This
capacity to capture code is invaluable to creating intuitive
sublanguages like dplyr
or ggplot2
. The third
change
that I introduce to R's syntax focuses on the subset of DSLs that
manipulate blocks of code, such as the great testthat
package.
Currently, blocks of code are passed to a function via curly brackets:
test_that("my code works", { ... })
Wouldn't it be nicer to have the same syntax as function definitions, for loops and if-else branches? That's the purpose of this second syntax change. It allows you to write:
test_that("my code works") { ... }
That's a fairly cosmetic change and admittedly not earth
shattering. However, it makes the language a bit nicer and
easthetically pleasing. This syntax would be a particulary nice for
alternative ways of defining functions. For example, the type-checked
functions of the
ensurer
package
would look a bit more natural:
type_checked <- function_(a ~ integer, b ~ character) { some_call(a) other_call(b) }
To make this work in the most R-like possible way, I decided to let
the function call be any expression. This mirrors the syntax of
regular function calls which may be embedded in arbitrary ways. In the
following snippet, russian_dolls()
returns a list whose first
element is a function that returns a function that returns 3:
russian_dolls()[[1]]()() #> 3
This kind of constructs are also possible with labelled blocks:
my_block[[1]]()() { code }
The only requirement is that the end result of the expression be a
function that accepts at least one argument (the block of code). This
means that test_that()
would be implemented in this way:
test_that <- function(desc) { force(desc) function(code) { test_code(desc, substitute(code), env = parent.frame()) invisible() } }
Then,
test_that("my code works") { check_equal(A, B) check_identical(C, D) } # Is actually equivalent to (function(code) { test_code(desc, substitute(code), env = parent.frame()) invisible() })({ check_equal(A, B) check_identical(C, D) })
In addition to expressions, simple labels are of course allowed:
label { line1 line2 }
For instance this would fit well with the
Nimble DSL for specifying Bugs
models. Simple labels work a bit differently than expressions
though. Here, instead of looking for a function named label
, the
parser will look for label{}
. This makes it possible to use the same
identifier for a regular function call and a labelled block:
label <- function() 3 `label{}` <- function(code) 4 label() #> [1] 3 label { anything } #> [1] 4
Finally, note that contrary to other labelled blocks such as function definitions, the opening curly bracket must be on the same line as its identifier. Otherwise it would be ambiguous whether we have a labelled block or two expressions separated by a newline:
label {code}
This slight inconsistency is the price to pay for that syntax extension.
Piping Operator
This is of course the syntax update that many people of the R
community are waiting for. A native piping operator. Some of the most
popular R packages are based on piped interface: dplyr
who
popularised magrittr
, but also ggplot2
. The latter uses a custom
non-functional pipeline by overloading the +
operator but the sequel
ggvis
does rely on functional piping. I provide for testing purposes
two versions
of a native pipe operator, |>
and >>
.
Given the popularity of the pipe, having native support for it in R's
syntax would be a huge progress. Besides the obvious aesthetic concern
(though you do get accustomed to %>%
with time) native handling of
the pipe would improve error recovery. Here is how a traceback
currently looks like with magrittr's pipe:
fail <- function(...) stop("fail") mtcars %>% lapply(fail) %>% unlist() #> Error in FUN(X[[i]], ...) (from #1) : fail traceback() #> 15: stop("fail") at #1 #> 14: FUN(X[[i]], ...) #> 13: lapply(., fail) #> 12: function_list[[1L]](value) #> 11: unlist(.) #> 10: function_list[[1L]](value) #> 9: withVisible(function_list[[1L]](value)) #> 8: freduce(value, `_function_list`) #> 7: Recall(function_list[[1L]](value), function_list[-1L]) #> 6: freduce(value, `_function_list`) #> 5: `_fseq`(`_lhs`) #> 4: eval(expr, envir, enclos) #> 3: eval(quote(`_fseq`(`_lhs`)), env, env) #> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) #> 1: mtcars %>% lapply(fail) %>% unlist()
This ugly traceback includes all the steps where magrittr manipulates the unevaluated code. Here is the same traceback with native support:
mtcars |> lapply(fail) |> unlist() #> Error in FUN(X[[i]], ...) : fail traceback() #> 4: stop("fail") at #1 #> 3: FUN(X[[i]], ...) #> 2: lapply(mtcars, fail) #> 1: unlist(mtcars |> lapply(fail))
The _
character is also legalised so it can become the placeholder
in pipelines. The same rules as with magrittr's placeholder apply:
mtcars |> list(_, _) |> identical(list(mtcars, mtcars)) #> [1] TRUE mtcars |> list(list(_, _)) |> identical(list(mtcars, list(mtcars, mtcars))) #> [1] TRUE
I actually provide two implementations of the pipe. The first creates a classic binary operator that calls a special primitive function. These are a class of core R function that do not evaluate their arguments, which allows them to manipulate quoted code before evaluation.
The second implementation, called by the >>
operator, directly
manipulate the parse tree. This means that you cannot redefine >>
. R
will always transform the expression object >> call()
to
call(object)
and you'll never get a chance to call the operator
manually with prefix notation. Such syntax transformation applies to a
few operators in R, like the rightward assignment op ->
or the
double starred exponentiation **
. By contrast, the first operator
|>
does accept to be redefined and called with prefix notation.
I think the first implementation is more natural in the R language and
consistent with most operators. On the other hand, manipulating the
parse tree ensures that the placeholder _
will always act
consistently as a shortcut for the LHS. This would avoid the conflicts
that arise with the .
placeholder which is currently used for
different conflicting purposes in dplyr, magrittr and purrr. Thus
there are pros and cons for both approaches.
Could this get into R Core?
R Core has gotten the reputation of being a bit conservative, which is only fair considering the responsibility that weighs on their shoulders.
I think that contrarily to proposals for integrating optional type checking in the syntax, all four of these syntax changes clearly fit the spirit of R as a dynamic, functional language. When it makes sense, they can be manipulated like first class citizens through prefix notation like other language constructs. They shouldn't disturb any existing R code and they improve currently used R idioms rather than invent new ones. So I think there is a chance that R core could consider some of them.
More testing is needed to assess the consequences in terms of performance and backward compatibility, though I didn't find any problem from my limited testing. One point of contention might be that the bare brackets and labelled blocks increase the number of shift-reduce conflicts during parser generation. I guess many of those could be fixed by refactoring the grammar a bit, or adding precedence and association directives to some production rules. But core members will probably feel a bit nervous about applying non-trivial changes to that fundamental part of the R code that basically didn't change since the first available revision in 1997. It's probably ok to ignore these conflicts however. There's currently 81 of them and Bison, the parser generator, seems to be doing a very good job of automatically resolving the ambiguities.
My plan is to get community feedback on Twitter before proposing the changes to R core. In case they are interested in some them, I'll run a comprehensive test on CRAN packages to make sure that the new syntax doesn't break anything.
So, could R 4.0 look like this?
test_that("new syntax works") { data <- list(mtcars, 1, 2, list(3, mtcars, 4)) expected <- lapply(data, function(x) is.list(x) || is.double(x)) mtcars |> [1, 2, [3, _, 4]] |> map([x] -> is.list(x) || is.double(x)) |> check_equal(expected) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.