Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Bob Horton
Sr. Data Scientist at Microsoft
Common table expressions (CTEs, or “WITH clauses”) are a syntactic feature in SQL that makes it easier to write and use subqueries. They act as views or temporary tables that are only available during the lifetime of a single query. A more sophisticated feature is the “recursive CTE”, which is a common table expression that can call itself, providing a convenient syntax for recursive queries. This is very useful, for example, in following paths of links from record to record, as in graph traversal.
This capability is supported in Postgres, and Microsoft SQL Server (Oracle has similar capabilities with a different syntax), but not in MySql. Perhaps surprisingly, it is supported in SQLite, and since SQLite is the default backend for sqldf
, this gives R users a convenient way to experiment with recursive CTEs.
Factorials
This is the example from the Wikipedia article on hierarchical and recursive queries in SQL; you just pass it to sqldf
and it works.
library('sqldf') sqldf("WITH RECURSIVE temp (n, fact) AS (SELECT 0, 1 -- Initial Subquery UNION ALL SELECT n+1, (n+1)*fact FROM temp -- Recursive Subquery WHERE n < 9) SELECT * FROM temp;")
## n fact ## 1 0 1 ## 2 1 1 ## 3 2 2 ## 4 3 6 ## 5 4 24 ## 6 5 120 ## 7 6 720 ## 8 7 5040 ## 9 8 40320 ## 10 9 362880
Other databases may use slightly different syntax (for example, if you want to run this query in Microsoft SQL Server, you need to leave out the word RECURSIVE
), but the concept is pretty general. Here the recursive CTE named temp
is defined in a with
clause. As usual with recursion, you need a base case (here labeled “Initial Subquery”), and a recursive case (“Recursive Subquery”) that performs a select operation on itself. These two cases are put together using a UNION
statement (basically the SQL equivalent of rbind
). The last line in the query kicks off the computation by running a SELECT
statement from this CTE.
Family Tree
Let’s make a toy family tree, so we can use recursion to find all the ancestors of a given person.
family <- data.frame( person = c("Alice", "Brian", "Cathy", "Danny", "Edgar", "Fiona", "Gregg", "Heidi", "Irene", "Jerry", "Karla"), mom = c(rep(NA, 4), c('Alice', 'Alice', 'Cathy', 'Cathy', 'Cathy', 'Fiona', 'Fiona')), dad = c(rep(NA, 4), c('Brian', 'Brian', 'Danny', 'Danny', 'Danny', 'Gregg', 'Gregg')), stringsAsFactors=FALSE)
We can visualize this family tree as a graph:
library(graph) nodes <- family$person edges <- apply(family, 1, function(r) { r <- r[c("mom", "dad")] r <- r[!is.na(r)] list(edges=r) # c(r['mom'], r['dad']) }) names(edges) <- names(nodes) <- nodes g <- graphNEL(nodes=nodes, edgeL=edges, edgemode='directed') library(Rgraphviz) # from Bioconductor g <- layoutGraph(g, layoutType="dot", attrs=list(graph=list(rankdir="BT"))) renderGraph(g)
Pointing from child to parents is backwards from how family trees are normally drawn, but this reflects how the table is laid out. I built the table this way because everybody always has exactly two biological parents, regardless of family structure.
SQLite only supports a single recursive call, so we can’t recurse on both the mom
and dad
columns. To be able to trace back through both parents, I put the table in “long form”; now each parent is entered in a separate row, with ‘mom’ and ‘dad’ being values in a new column called ‘parent’.
library(tidyr) long_family <- gather(family, parent, parent_name, -person) knitr::kable(head(long_family))
person | parent | parent_name |
---|---|---|
Alice | mom | NA |
Brian | mom | NA |
Cathy | mom | NA |
Danny | mom | NA |
Edgar | mom | Alice |
Fiona | mom | Alice |
Now we can use a recursive CTE to find all the ancestors in the database for a given person:
ancestors_sql <- " WITH ancestors (name, parent, parent_name, level) AS ( SELECT person, parent, parent_name, 1 FROM long_family WHERE person = '%s' UNION ALL SELECT A.person, A.parent, A.parent_name, P.level + 1 FROM ancestors P JOIN long_family A ON P.parent_name = A.person) SELECT * FROM ancestors ORDER BY level, name, parent" sqldf(sprintf(ancestors_sql, 'Jerry'))
## name parent parent_name level ## 1 Jerry dad Gregg 1 ## 2 Jerry mom Fiona 1 ## 3 Fiona dad Brian 2 ## 4 Fiona mom Alice 2 ## 5 Gregg dad Danny 2 ## 6 Gregg mom Cathy 2 ## 7 Alice dad <NA> 3 ## 8 Alice mom <NA> 3 ## 9 Brian dad <NA> 3 ## 10 Brian mom <NA> 3 ## 11 Cathy dad <NA> 3 ## 12 Cathy mom <NA> 3 ## 13 Danny dad <NA> 3 ## 14 Danny mom <NA> 3
sqldf(sprintf(ancestors_sql, 'Heidi'))
## name parent parent_name level ## 1 Heidi dad Danny 1 ## 2 Heidi mom Cathy 1 ## 3 Cathy dad <NA> 2 ## 4 Cathy mom <NA> 2 ## 5 Danny dad <NA> 2 ## 6 Danny mom <NA> 2
sqldf(sprintf(ancestors_sql, 'Cathy'))
## name parent parent_name level ## 1 Cathy dad <NA> 1 ## 2 Cathy mom <NA> 1
We can go the other way as well, and find all of a person’s descendants:
descendants_sql <- " WITH RECURSIVE descendants (name, parent, parent_name, level) AS ( SELECT person, parent, parent_name, 1 FROM long_family WHERE person = '%s' AND parent = '%s' UNION ALL SELECT F.person, F.parent, F.parent_name, D.level + 1 FROM descendants D JOIN long_family F ON F.parent_name = D.name) SELECT * FROM descendants ORDER BY level, name " sqldf(sprintf(descendants_sql, 'Cathy', 'mom'))
## name parent parent_name level ## 1 Cathy mom <NA> 1 ## 2 Gregg mom Cathy 2 ## 3 Heidi mom Cathy 2 ## 4 Irene mom Cathy 2 ## 5 Jerry dad Gregg 3 ## 6 Karla dad Gregg 3
If you work with tree- or graph-structured data in a database, recursive CTEs can make your life much easier. Having them on hand in SQLite and usable through sqldf
makes it very easy to get started learning to use them.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.