Database Reflection using dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At work I write a ton of SQL, and I do most of my querying using R. The workflow goes:
- Create a string with the SQL in R
- Plug the string into fetchQuery (see my previous post)
This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.
I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere. Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.
One thing I want out of an ORM is the ability to see every table in our databases all at once. That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables. So I decided to implement the reflection myself.
#' Get the table information for a postgres database #' @param config the configuration list #' @return the table names, columns, and column types of all columns in the database getTableInformation <- function(config = config.gp) { tables <- fetchQuery( "SELECT table_name, column_name, data_type FROM information_schema.columns WHERE table_name NOT LIKE '%prt%' AND table_name NOT LIKE '%ext%' AND table_name NOT LIKE '%tmp%' ORDER BY 1, 2", config ) } #' Replacement of the normal update function, you don't need to call this. update <- function(object, ...) { args <- list(...) for (nm in names(args)) { object[[nm]] <- args[[nm]] } if (is.null(object$select)) { if (is.ident(object$from)) { var_names <- object$select } else { var_names <- qry_fields(object$src$con, object$from) } vars <- lapply(var_names, as.name) object$select <- vars } object$query <- dplyr:::build_query(object) object } #' Function to reflect a database, generalizable to others beyond postgres #' by simply changing getTableInformation appropriately reflectDatabase <- function(config, envir.name = "tables", subclass = "postgres") { if (!(envir.name %in% search())) { envir <- new.env(parent = .GlobalEnv) } else { envir <- as.environment(envir.name) detach(envir.name, character.only = TRUE) } src <- do.call(src_postgres, config) tables <- getTableInformation(config) tables <- split(tables, tables$table_name) lapply(tables, function(i) { nm <- ident(i$table_name[1]) vars <- lapply(i$column_name, as.name) tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm, select = vars, summarise = FALSE, mutate = FALSE, where = NULL, group_by = NULL, order_by = NULL) tbl <- update(tbl) assign( nm, tbl, envir = envir ) }) attach(envir, name = envir.name) } searchTables <- function(str, env = "tables") { all.tbls <- ls(env) all.tbls[grep(str, all.tbls)] }
To use this function, you can simply call
reflectDatabase()
and if you’re using a Postgres database, that should be it!
The fun part now, is that I can do things like
res <- inner_join(my_table_1, my_table_2)
where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.
For example:
searchTables('user')
returns all tables in our database with the string “user” in them.
These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.
Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.