Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Consider the common following problem: compute for a data set (say the infamous iris
example data set) per-group ranks. Suppose we want the rank of iris
Sepal.Length
s on a per-Species
basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is not likely ever the analyst’s end goal but a sub-step needed to transform data on the way to the prediction, modeling, analysis, or presentation they actually wish to get back to.
In our previous article in this series we discussed the general ideas of “row-ID independent data manipulation” and “Split-Apply-Combine”. Here, continuing with our example, we will specialize to a data analysis pattern I call: “Grouped-Ordered-Apply”.
The example
Let’s start (as always) with our data. We are going to look at the iris
data set in R
. You can view the data in by typing the following in your R
console:
data(iris)
View(iris)
The package dplyr
makes the grouped calculation quite easy. We define our “window function” (function we want applied to sub-groups of data in a given order) and then use dplyr
to apply the function to grouped and arranged data:
library('dplyr')
# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)
# calculate
res <-
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>%
rank_in_group
# display first few results
res %>% filter(rank < dbl> < dbl> < dbl> < fctr> < dbl>
# 1 5.8 4.0 1.2 0.2 setosa 1
# 2 5.7 4.4 1.5 0.4 setosa 2
# 3 7.0 3.2 4.7 1.4 versicolor 1
# 4 6.9 3.1 4.9 1.5 versicolor 2
# 5 7.9 3.8 6.4 2.0 virginica 1
# 6 7.7 3.8 6.7 2.2 virginica 2
Some possible confusion
The above works well, because all the operators we used were “grouping aware.” I think all dplyr
operations are “grouping aware”, but some of the “in the street” tactics of “working with or around dplyr
” may not be. For example slice
is part of dplyr
and grouping aware:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% slice(1)
# Source: local data frame [3 x 5]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# < dbl> < dbl> < dbl> < dbl> < fctr>
# 1 5.8 4.0 1.2 0.2 setosa
# 2 7.0 3.2 4.7 1.4 versicolor
# 3 7.9 3.8 6.4 2.0 virginica
But head
is not part of dplyr
and not grouping aware:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(n=1)
# Source: local data frame [1 x 5]
# Groups: Species [1]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# < dbl> < dbl> < dbl> < dbl> < fctr>
# 1 7.9 3.8 6.4 2 virginica
Thus head
“is not part of dplyr
” even when it is likely a dplyr
adapter supplying the actual implementation.
This can be very confusing to new analysts. We are seeing changes in semantics of downstream operators based on a data annotation (the “Groups:”). To the analyst grouping and ordering probably have equal stature. In dplyr
grouping comes first, has a visible annotation, is durable, and changes the semantics of downstream operators. In dplyr
ordering has no annotation, is not durable (it is quietly lost by many operators such as dplyr::compute
and dplyr::collapse
, though this is possibly changing), and can’t be stored (as it isn’t a concept in many back-ends such as relational databases).
It is hard for new analysts to trust dplyr
the data iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))
is both grouped and ordered. As we see below order is in the presentation (and not annotated) and grouping is annotated (but not in the presentation):
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% print(n=50)
# Source: local data frame [150 x 5]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# < dbl> < dbl> < dbl> < dbl> < fctr>
# 1 7.9 3.8 6.4 2.0 virginica
# ...
# 12 7.1 3.0 5.9 2.1 virginica
# 13 7.0 3.2 4.7 1.4 versicolor
# 14 6.9 3.1 4.9 1.5 versicolor
# 15 6.9 3.2 5.7 2.3 virginica
# ...
Notice the apparent mixing of the groups in presentation. That is part of why there is a visible Groups:
annotation, the grouping can not be inferred from the data presentation.
My opinion
Frankly it can be hard to document and verify which dplyr
pipelines are maintaining the semantics you intended. We have every reason to believe the following is both grouped and ordered:
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length))
It is ordered as dplyr::arrange
is the last step and we can verify the grouping is present with dplyr::groups()
.
We have less reason to trust the following is also grouped and ordered (especially in remote databases or Spark
):
iris %>% arrange(desc(Sepal.Length)) %>% group_by(Species)
The above may be simultaneously grouped and ordered (i.e. have not lost the order), but for reasons of “trust, but verify” it would be nice to have a user-visible annotation certifying that. Remember, explicitly verifying the order requires the use of a window-function (such as lag
) so verifying order by hand isn’t always a convenient option.
We need to put some of the dplyr
machinery in a housing to keep our fingers from getting into the gears.
Essentially this is saying wrap Hadley Wickham’s “The Split-Apply-Combine Strategy for Data Analysis” (link) concept into a single atomic operation with semantics:
data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()
which we call “Grouped-Ordered-Apply.”
By wrapping the pipeline into a single “Grouped-Ordered-Apply” operation we are deliberately making intermediate results not visible. This is exactly what is needed to get rid of depending on distinctions of how partitioning is enforced (be it by a grouping annotation, or with an actual split) and worrying about the order of the internal operations.
A Suggested Solution
Our new package replyr
supplies the “Grouped-Ordered-Apply” operation as replyr::gapply
(itself built on top of dplyr
). It performs the above grouped/ordered calculation as follows:
library('dplyr')
# install.packages('replyr') # Run this if you don't already have replyr
library('replyr')
data(iris)
# define our operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)
# apply our operation to data that is simultaneously grouped and ordered
res <-
iris %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal.Length',decreasing=TRUE)
# present results
res %>% filter(rank
replyr::gapply
can use either a split based strategy, or a dplyr::group_by_
based strategy for calculation. Notice replyr::gapply
‘s preference for “parametric treatment of variables.” replyr::gapply
anticipates that the analyst may not know the names of columns or variables when they are writing their code, but may in fact need to take names as values stored in other variables. Essentially we are making dplyr::*_
forms preferred. The rank_in_group
is using dplyr
preferred non-standard evaluation, which assumes the analyst knows the names of the columns they are manipulating; that is appropriate for transient user code.
Now that we have the rank annotations present we can try to confirm they are in fact correct (i.e. that the implementation maintained grouping and ranking throughout). The calculation is detailed (checking ranks are unique per-group, integers in the range 1 to group-size, and order compatible with the value column Sepal.Length
), so we have wrapped the calculation in replyr
:
replyr_check_ranks(res,
GroupColumnName='Species',
ValueColumnName='Sepal.Length',
RankColumnName='rank',
decreasing=TRUE)
# goodRankedGroup groupID nRows nGroups nBadRanks nUniqueRanks nBadOrders
# 1 TRUE setosa 50 1 0 50 0
# 2 TRUE versicolor 50 1 0 50 0
# 3 TRUE virginica 50 1 0 50 0
For simplicity we wrote the primary checking function in terms of operations that happen to be only correct when there is only one group present (i.e. the function needs formal splitting and isolation, not just dplyr::group_by
). This isn’t a problem as we can then use replyr::gapply(partitionMetod='split')
to correctly apply such code to all groups in turn.
Notice the Split-Apply-Combine steps are all wrapped together and supplied as part of the service; the user only supplies (column1,column2,f())
. The transient lifetime and limited visibility of the sub-stages of the wrapped calculation are the appropriate abstractions given the fragility of row-order in modern data stores. The user doesn’t care if the data is actually split and ordered, as long as it is presented to their function as if it were so structured. We are using the Split-Apply-Combine pattern, but abstracting out if it is actually implemented by formal splitting (ameliorating the differences between base::split
, tidyr::nest
and SQL GROUP BY ... ORDER BY ...
). There are benefits in isolating the user-visible semantics from the details of realization.
Much can be written in terms of this pattern including grouped ranking problems, dplyr::summarize
, and more. And this is precisely the semantics of gapply
(grouped ordered apply) found in replyr
.
Additional examples
An advantage of using the general notation as above is that dplyr
has implementations that work on large remote data services such as databases and Spark
.
For example here is the “rank within group” calculation performed in PostgreSQL
(assuming you have such a database up, and using your own user/password). For these additional examples we are going to continue to suppose our goal is to compute the rank of Sepal.Length
for irises grouped by Species
.
# install.packages('replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
data(iris)
# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)
# get a databse handle and copy the data into the database
my_db <- dplyr::src_postgres(host = 'localhost', port = 5432,
user = 'postgres', password = 'pg')
irisD <- replyr_copy_to(my_db,iris)
# run the ranking in the database
res <-
irisD %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal.Length',decreasing=TRUE)
# present results
res %>% filter(rank < dbl> < dbl> < dbl> < chr> < dbl>
# 1 5.8 4.0 1.2 0.2 setosa 1
# 2 5.7 3.8 1.7 0.3 setosa 2
# 3 7.0 3.2 4.7 1.4 versicolor 1
# 4 6.9 3.1 4.9 1.5 versicolor 2
# 5 7.9 3.8 6.4 2.0 virginica 1
# 6 7.7 3.0 6.1 2.3 virginica 2
dplyr::group_by
can also perform the grouped ordered calculation in PostgreSQL
using the code below:
irisD %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% rank_in_group %>%
filter(rank<=2) %>% arrange(Species,rank)
We can even perform the same calculation using Spark and sparklyr.
# get a Spark handle and copy the data in
my_s <- sparklyr::spark_connect(master = "local")
irisS <- replyr_copy_to(my_s,iris)
# re-run the ranking in Spark
irisS %>% gapply(gcolumn='Species',
f=rank_in_group,
ocolumn='Sepal_Length',decreasing=TRUE) %>%
filter(rank < dbl> < dbl> < dbl> < chr> < dbl>
# 1 5.8 4.0 1.2 0.2 setosa 1
# 2 5.7 4.4 1.5 0.4 setosa 2
# 3 7.0 3.2 4.7 1.4 versicolor 1
# 4 6.9 3.1 4.9 1.5 versicolor 2
# 5 7.9 3.8 6.4 2.0 virginica 1
# 6 7.7 3.8 6.7 2.2 virginica 2
Notice the sparklyr
adapter changed column names by replacing “.
” with “_
“, so we had to change our ordering column specification “ocolumn='Sepal_Length'
” to match. This is the only accommodation we had to make to switch to a Spark
service. Outside of R
(and Lisp
) dots in identifiers are considered a bad idea and should be avoided. For instances most SQL
databases reserved dots to indicate relations between schemas, tables, and columns (so it is only through sophisticated quoting mechanisms that the PostgreSQL
example was able to use dots in column names).
dplyr
directly completes the same calculation with:
irisS %>% group_by(Species) %>% arrange(desc(Sepal_Length)) %>% rank_in_group %>%
filter(rank<=2) %>% arrange(Species,rank)
And that is the Grouped-Ordered-Apply pattern and our dplyr
based reference implementation. (Exercise for the reader: implement SQL
‘s useful “UPDATE WHERE
” operation in terms of replyr::gapply
.)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.