Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how R
data.frame
s describe themselves (try “str(data.frame(x=1:2))
” in an R
-console to see this) and is part of the tidy data manifesto.
Tools like SQL
(structured query language) and dplyr
can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below.
SQL
SQL
is actually a rich modeling environment with interesting semantics due to its relation to E.F. Codd’s relational model. However, working with SQL
databases presents a number of practical difficulties.
- Analysts are forced to think in relational terms (normalized forms, key columns, constraints, and so on)
- Not being dedicated to transient analysis,
SQL
databases are often are configured to prohibit the alteration of the data in place (which makes sense, analysis should not change the initial facts). - “Full” databases discourage copying tables or making a large number of intermediate structures
- Ongoing transactions transactions cause data to change over time.
Working around these difficulties, perhaps by setting up a staging database, or even containerizing the database, still leaves two problems:
- The difficulty of the
SQL
syntax itself - The idea that you are not supposed to identify rows or depend on implicit row order.
These two problems are side-effects of being forced to “think in relational terms.” The first problem can be mitigated with adapters, such as dplyr
. The second problem is one to take to heart: stop thinking about row numbers and implicit row order. Learn to re-phrase your analysis in a row-ID independent manner. Most predictive analytic goals are row-ID independent (example: fitting a linear model doesn’t depend on row order), so it makes sense from a point of purity to insist that the intermediate analysis steps also not depend on row-IDs.
We are not going to directly demonstrate any SQL
queries in this article, but be aware that SQL
concepts and relational model concepts are the foundation of a lot of the user-visible semantics of substantial data stores (such as PostreSQL
and Spark
).
Split-Apply-Combine and dplyr
Most transformations that appear to need row-ids (such as subsetting, and computing cumulative sums) can be re-phrased in terms of grouping and ordering. Many ideas for accomplishing this reformulation can be found in Hadley Wickham’s article “The Split-Apply-Combine Strategy for Data Analysis” (link).
Three obvious ways to implement the Split-Apply-Combine strategy are:
base::split
followed bylapply(.,f)
anddplyr::bind_rows
tidyr::nest
followed bylapply(.,f)
andtidyr::unnest
dplyr::group_by
followed byf
anddplyr::ungroup
Of the three, only the dplyr::group_by
realization is commonly practical on large remote data sources. However, in our own replyr::gapply
we are experimenting with working with including an actual split
based approach.
A proper modern SQL
implementation (such as PostgreSQL
or Spark SQL
) supports all of the operations of Split-Apply-Combine (though pivoting/un-pivoting or gather/spread has fairly shaky database support, despite database documentation claiming otherwise). SQL
just expresses grouping and ordering in an ugly way: with the grouping and ordering both being aspects of the query and not annotations on the data. This makes sense as one of the cases SQL
considers is read-only data with no temporary tables, so there is nowhere for the user to maintain the annotations other than in the query.
SQL
comes from the age of data centers when the computer’s time was more valuable than the analysts’ time. Thus, burdens were placed on the analysts (i.e. in the SQL
). However, as mentioned above, this difficulty can be overcome with adapters. SQL
queries can hold all the annotations and constraints needed for sophisticated analysis, and this process can be automated. The analyst does not have to be the one to put them in the query.
dplyr
seeks to mitigate creeping SQL
query complexity through adapters and many notation improvements, and includes one major semantic improvement: grouping is a visible annotation on the data. For example we move from writing seemingly convenient index-driven code that looks like:
data(iris)
irism <- iris
setosaIndexes <- which(irism$Species=='setosa')
irism[setosaIndexes,'Sepal.Length'] <- 1 + irism[setosaIndexes,'Sepal.Length']
To writing purer transforms such as:
library('dplyr')
data(iris)
irism <- iris
irism %>% split(irism$Species) -> itable
itable$setosa$Sepal.Length <- 1 + itable$setosa$Sepal.Length
irism <- dplyr::bind_rows(itable)
At first this seems like a lot of work, but the row-ID free notation eventually pays off. The Split-Apply-Combine approach is much more composable and usable on many remote data sources (many of which deliberately do not have concepts of durable row-order or row-numbers).
Split-Apply-Combine is interesting because it lets us perform arbitrary simultaneous grouped and ordered operations such as: in-group min/max, in-group ranking, in-group cumulative sums, in-group replacement of values, and so on. Split-Apply-Combine achieves this is by maintaining groups by splitting, and imposing order by later sub-queries or per-group operations (SQL
tables are not guaranteed to maintain row-orders, but sub-queries can). However, I have come to think of splitting is an implementation detail that we want to limit the lifetime and visibility of.
I am coming to think the useful Split-Apply-Combine primitive pattern looks like a wrapped function supplying the semantics:
data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()
where “f()
” is a user supplied function or magrittr
pipeline.
In my next article in this series I will further discuss the specialization of Split-Apply-Combine to this Grouped-Ordered-Apply pattern.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.