Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
seplyr
is an R
package that makes it easy to program over dplyr
0.7.*
.
To illustrate this we will work an example.
Suppose you had worked out a dplyr
pipeline that performed an analysis you were interested in. For an example we could take something similar to one of the examples from the dplyr
0.7.0
announcement.
suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr")
## [1] '0.7.2'
cat(colnames(starwars), sep='\n')
## name ## height ## mass ## hair_color ## skin_color ## eye_color ## birth_year ## gender ## homeworld ## species ## films ## vehicles ## starships
starwars %>% group_by(homeworld) %>% summarise(mean_height = mean(height, na.rm = TRUE), mean_mass = mean(mass, na.rm = TRUE), count = n())
## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
The above is colloquially called "an interactive script." The name comes from the fact that we use names of variables (such as "homeworld
") that would only be known from looking at the data directly in the analysis code. Only somebody interacting with the data could write such a script (hence the name).
It has long been considered a point of discomfort to convert such an interactive dplyr
pipeline into a re-usable script or function. That is a script or function that specifies column names in some parametric or re-usable fashion. Roughly it means the names of the data columns are not yet known when we are writing the code (and this is what makes the code re-usable).
This inessential (or conquerable) difficulty is largely a due to the preference for non-standard evaluation interfaces (that is interfaces that capture and inspect un-evaluated expressions from their calling interface) in the design dplyr
.
seplyr
is a dplyr
adapter layer that prefers "slightly clunkier" standard interfaces (or referentially transparent interfaces), which are actually very powerful and can be used to some advantage.
The above description and comparisons can come off as needlessly broad and painfully abstract. Things are much clearer if we move away from theory and return to our practical example.
Let’s translate the above example into a re-usable function in small (easy) stages. First translate the interactive script from dplyr
notation into seplyr
notation. This step is a pure re-factoring, we are changing the code without changing its observable external behavior.
The translation is mechanical in that it is mostly using seplyr
documentation as a lookup table. What you have to do is:
- Change
dplyr
verbs to their matchingseplyr
"*_se()
" adapters. - Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the
summarize()
) to explicit vectors by adding the "c()
" notation. - Replace "
=
" in expressions with ":=
".
Our converted code looks like the following.
# devtools::install_github("WinVector/seplyr") library("seplyr") starwars %>% group_by_se("homeworld") %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()"))
## # A tibble: 49 x 4 ## homeworld mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 Alderaan 176.3333 64.0 3 ## 2 Aleen Minor 79.0000 15.0 1 ## 3 Bespin 175.0000 79.0 1 ## 4 Bestine IV 180.0000 110.0 1 ## 5 Cato Neimoidia 191.0000 90.0 1 ## 6 Cerea 198.0000 82.0 1 ## 7 Champala 196.0000 NaN 1 ## 8 Chandrila 150.0000 NaN 1 ## 9 Concord Dawn 183.0000 79.0 1 ## 10 Corellia 175.0000 78.5 2 ## # ... with 39 more rows
This code works the same as the original dplyr
code. Obviously at this point all we have done is: worked to make the code a bit less pleasant looking. We have yet to see any benefit from this conversion (though we can turn this on its head and say all the original dplyr
notation is saving us is from having to write a few quote marks).
The benefit is: this new code can very easily be parameterized and wrapped in a re-usable function. In fact it is now simpler to do than to describe.
For example: suppose (as in the original example) we want to create a function that lets us choose the grouping variable? This is now easy, we copy the code into a function and replace the explicit value "homeworld"
with a variable:
starwars_mean <- function(my_var) { starwars %>% group_by_se(my_var) %>% summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)", "mean_mass" := "mean(mass, na.rm = TRUE)", "count" := "n()")) } starwars_mean("hair_color")
## # A tibble: 13 x 4 ## hair_color mean_height mean_mass count ## <chr> <dbl> <dbl> <int> ## 1 auburn 150.0000 NaN 1 ## 2 auburn, grey 180.0000 NaN 1 ## 3 auburn, white 182.0000 77.00000 1 ## 4 black 174.3333 73.05714 13 ## 5 blond 176.6667 80.50000 3 ## 6 blonde 168.0000 55.00000 1 ## 7 brown 175.2667 79.27273 18 ## 8 brown, grey 178.0000 120.00000 1 ## 9 grey 170.0000 75.00000 1 ## 10 none 180.8889 78.51852 37 ## 11 unknown NaN NaN 1 ## 12 white 156.0000 59.66667 4 ## 13 <NA> 141.6000 314.20000 5
In seplyr
programming is easy (just replace values with variables). For example we can make a completely generic re-usable "grouped mean" function using R
‘s paste()
function to build up expressions.
grouped_mean <- function(data, grouping_variables, value_variables) { result_names <- paste0("mean_", value_variables) expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)") calculation <- result_names := expressions print(as.list(calculation)) # print for demonstration data %>% group_by_se(grouping_variables) %>% summarize_se(c(calculation, "count" := "n()")) } starwars %>% grouped_mean(grouping_variables = "eye_color", value_variables = c("mass", "birth_year"))
## $mean_mass ## [1] "mean(mass, na.rm = TRUE)" ## ## $mean_birth_year ## [1] "mean(birth_year, na.rm = TRUE)" ## # A tibble: 15 x 4 ## eye_color mean_mass mean_birth_year count ## <chr> <dbl> <dbl> <int> ## 1 black 76.28571 33.00000 10 ## 2 blue 86.51667 67.06923 19 ## 3 blue-gray 77.00000 57.00000 1 ## 4 brown 66.09231 108.96429 21 ## 5 dark NaN NaN 1 ## 6 gold NaN NaN 1 ## 7 green, yellow 159.00000 NaN 1 ## 8 hazel 66.00000 34.50000 3 ## 9 orange 282.33333 231.00000 8 ## 10 pink NaN NaN 1 ## 11 red 81.40000 33.66667 5 ## 12 red, blue NaN NaN 1 ## 13 unknown 31.50000 NaN 3 ## 14 white 48.00000 NaN 1 ## 15 yellow 81.11111 76.38000 11
The only part that requires more study and practice was messing around with the expressions using paste()
(for more details on the string manipulation please try "help(paste)
"). Notice also we used the ":=
" operator to bind the list of desired result names to the matching calculations (please see "help(named_map_builder)
" for more details).
The point is: we did not have to bring in (or study) any deep-theory or heavy-weight tools such as rlang
/tidyeval
or lazyeval
to complete our programming task. Once you are in seplyr
notation, changes are very easy. You can separate translating into seplyr
notation from the work of designing your wrapper function (breaking your programming work into smaller easier to understand steps).
The seplyr
method is simple, easy to teach, and powerful. The package contains a number of worked examples both in help()
and vignette(package='seplyr')
documentation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.