R Tip: Break up Function Nesting for Legibility
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There are a number of easy ways to avoid illegible code nesting problems in R
.
In this R tip we will expand upon the above statement with a simple example.
At some point it becomes illegible and undesirable to compose operations by nesting them, such as in the following code.
head(mtcars[with(mtcars, cyl == 8), c("mpg", "cyl", "wt")]) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25
One popular way to break up nesting is to use magrittr
‘s “%>%
” in combination with dplyr
transform verbs as we show below.
library("dplyr") mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% head # mpg cyl wt # 1 18.7 8 3.44 # 2 14.3 8 3.57 # 3 16.4 8 4.07 # 4 17.3 8 3.73 # 5 15.2 8 3.78 # 6 10.4 8 5.25
Note: the above code lost (without warning) the row names that are part of mtcars
. We also pass over the details of how pipe notation works. It is sufficient to say the notational convention is: each stage is approximately treated as an altered function call with a new inserted first argument set to the value of the pipeline up to the current point.
Many R
users already routinely avoid nested notation problems through a convention I call “name re-use.” Such code looks like the following.
result <- mtcars result <- filter(result, cyl == 8) result <- select(result, mpg, cyl, wt) head(result)
The above convention is enough to get around all problems of nesting. It also has the great advantage that it is step-debuggable.
I like a variation I call “dot intermediates”, which looks like the code below (notice we are switching back from dplyr
verbs, to base R
operators).
. <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] result <- . head(result) # mpg cyl wt # Hornet Sportabout 18.7 8 3.44 # Duster 360 14.3 8 3.57 # Merc 450SE 16.4 8 4.07 # Merc 450SL 17.3 8 3.73 # Merc 450SLC 15.2 8 3.78 # Cadillac Fleetwood 10.4 8 5.25
The dot intermediate convention is very succinct, and we can use it with base R
transforms to get a correct (and performant) result. Like all conventions: it is just a matter of teaching, learning, and repetition to make this seem natural, familiar and legible.
library("dplyr") library("microbenchmark") library("ggplot2") timings <- microbenchmark( base = { . <- mtcars . <- subset(., cyl == 8) . <- .[, c("mpg", "cyl", "wt")] nrow(.) }, dplyr = { mtcars %>% filter(cyl == 8) %>% select(mpg, cyl, wt) %>% nrow }) print(timings) ## Unit: microseconds ## expr min lq mean median uq max neval ## base 122.948 136.948 167.2253 159.688 179.924 349.328 100 ## dplyr 1570.188 1654.700 2537.2912 1699.744 1785.611 50759.770 100 autoplot(timings)
Durations for related tasks, smaller is better.
Contrary to what many repeat, base R
is often faster than the dplyr
alternative. In this case the base R
is 15 times faster (possibly due to magrittr
overhead and the small size of this example). We also see, with some care, base R
can be quite legible. dplyr
is a useful tool and convention, however it is not the only allowed tool or only allowed convention.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.