Database Reflection using dplyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
At work I write a ton of SQL, and I do most of my querying using R. The workflow goes:
- Create a string with the SQL in R
- Plug the string into fetchQuery (see my previous post)
This solution works relatively well, but i’m a bit unhappy writing strings rather than using function calls.
I began working on my own ORM implementation, but it was very slow-go and it would have taken a lot of time to get anywhere. Luckily, I was pleasantly surprised that Hadley Wickham’s new dplyr package implements much of the ORM I was hoping for.
One thing I want out of an ORM is the ability to see every table in our databases all at once. That functionality, while implement-able using the dplyr package, would likely take quite a while on tens of thousands of tables. So I decided to implement the reflection myself.
#' Get the table information for a postgres database
#' @param config the configuration list
#' @return the table names, columns, and column types of all columns in the database
getTableInformation <- function(config = config.gp) {
tables <- fetchQuery(
"SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_name NOT LIKE '%prt%'
AND table_name NOT LIKE '%ext%'
AND table_name NOT LIKE '%tmp%'
ORDER BY 1, 2",
config
)
}
#' Replacement of the normal update function, you don't need to call this.
update <- function(object, ...) {
args <- list(...)
for (nm in names(args)) {
object[[nm]] <- args[[nm]]
}
if (is.null(object$select)) {
if (is.ident(object$from)) {
var_names <- object$select
}
else {
var_names <- qry_fields(object$src$con, object$from)
}
vars <- lapply(var_names, as.name)
object$select <- vars
}
object$query <- dplyr:::build_query(object)
object
}
#' Function to reflect a database, generalizable to others beyond postgres
#' by simply changing getTableInformation appropriately
reflectDatabase <- function(config, envir.name = "tables",
subclass = "postgres") {
if (!(envir.name %in% search())) {
envir <- new.env(parent = .GlobalEnv)
} else {
envir <- as.environment(envir.name)
detach(envir.name, character.only = TRUE)
}
src <- do.call(src_postgres, config)
tables <- getTableInformation(config)
tables <- split(tables, tables$table_name)
lapply(tables, function(i) {
nm <- ident(i$table_name[1])
vars <- lapply(i$column_name, as.name)
tbl <- dplyr::make_tbl(c(subclass, "sql"), src = src, from = nm,
select = vars, summarise = FALSE, mutate = FALSE,
where = NULL, group_by = NULL, order_by = NULL)
tbl <- update(tbl)
assign(
nm,
tbl,
envir = envir
)
})
attach(envir, name = envir.name)
}
searchTables <- function(str, env = "tables") {
all.tbls <- ls(env)
all.tbls[grep(str, all.tbls)]
}
To use this function, you can simply call
reflectDatabase()
and if you’re using a Postgres database, that should be it!
The fun part now, is that I can do things like
res <- inner_join(my_table_1, my_table_2)
where my_table_1 and my_table_2 are simply names of tables in my database. This provides me with auto-complete of table names, search-able table names and columns, etc.
For example:
searchTables('user')
returns all tables in our database with the string “user” in them.
These are some things I hope to see or find in dplyr, and may try to build myself if they don’t already exist:
1. Case statements in mutate
2. Creating table indexes
3. type checking of columns, and more informative error messages when un-sensible joins and filters are performed.
Overall this package seems like a lot of fun, and i’m excited to try to work it into my coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.