Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have a strong distaste for database connection management. All I want to do when I want to query one of our many databases at work is to simply supply the query, and package the result into an R data.frame or data.table.
R has many great database connection tools, including but not limited to RPostgreSQL, RMySQL, RDJBC, ROracle, and RODCBC. I set out to consolidate all of my database querying into a simple function, and I have succeeded for my purposes. I can connect to my company’s MySQL, Postgres, and Oracle databases with ease with my fetchQuery function, defined (in conjunction with it’s dependencies) as follows.
#' Make a connection to a database #' #' This function abstracts the idea of a database connection, allowing variable parameters #' depending on the type of database you're connecting to #'@param config a named list of the configuration options for the database connection #'@return a connection to the database defined in the config #'@author Erik Gregory makeCxn <- function(config) { if (class(config[['drv']]) == "character") { config[['drv']] <- dbDriver(config[['drv']]) } do.call(dbConnect, config) } #' This function runs a query on a database, fetching the result if desired #' #' The purpose of this function is to remove connection management from the querying process #' @param query the query you want to make to the SQL connection you've specified #' @param config a named list of the configuration options for the connection #' @param n the number of rows to return, or -1 for all rows #' @param verbose Should the queries be printed as they're made? #' @param split Should the queries be split on semicolons, or run as a block? #' @return A list of results if multiple queries, or a single result if one query. #' @author Erik Gregory fetchQuery <- function(query, config = config.gp, split = FALSE, verbose = TRUE, n = -1) { res <- list() cxn <- makeCxn(config) t1 <- Sys.time() queries <- query if (split == TRUE) { queries <- strsplit(query, ";", fixed = TRUE)[[1]] # Split the query into components } for (item in queries) { if(verbose) { cat(paste(item, '\n')) } tmp <- try(dbSendQuery(cxn, item)) # send the query if ('try-error' %in% class(tmp)) { res[[item]] <- dbGetException(cxn) next } type <- tolower(substring(gsub(" ", "", item), 0, 6)) # identify if select, insert, delete if (type == "select" | grepl("with..", type) | grepl('EXPLAI|explai', type) | !split) { res[[item]] <- try(fetch(tmp, n)) } else { res[[item]] <- dbGetRowsAffected(tmp) cat(res[[item]]) } if (verbose) { print(Sys.time() - t1) if (!is.null(dim(res))) { print(dim(res)) } } dbClearResult(tmp) } dbDisconnect(cxn) if (length(res) == 1) { res <- res[[1]] } res }
I set my default config parameter to fetchQuery to be my most commonly used connection. I define my connections in my file at ~/.Rprofile file. The effect of this is that I always have the configuration information in memory whenever I need them. An example is as follows:
config.gp <- list( user = "username", password = "password", dbname = "MY_DBNAME", host = "url_of_host", port = port_of_host, drv = "PostgreSQL" ) config.gp.admin <- list( user = "username_of_admin", password = "password_of_admin", dbname = "MY_DBNAME", host = "url_of_host", port = port_of_host, drv = "PostgreSQL" ) config.mysql <- list( user = "username", password = "password", dbname = "MY_DBNAME", host = "MY_HOST_IP_ADDRESS", drv = "MySQL" ) config.whse <- list( drv = JDBC("oracle.jdbc.OracleDriver", "/usr/lib/oracle/instantclient_11_2/ojdbc5.jar"), user = "username", password = "password", url = "url" )
Notice that the drv (driver) argument can be either an actual driver, or a character string of the driver type. My reason for this is that some driver initializations require multiple parameters, while some only require a single one. This could be made elegant by using a do.call argument, as defined in makeCxn above.
It is important that the connection lists defined in the ~/.Rprofile file
- Have the arguments you want to pass to dbConnect, named according to the value they correspond to in dbConnect
- Have no extra arguments that you don’t want to pass to dbConnect
Some explanations behind the reasoning of the arguments and methods of fetchQuery:
- Sometimes I want to run a bunch of queries all contained in one character string to my databases. This function will either split those queries by semicolons, or run them all in one batch depending on what you ask it to do. The advantage of the former is you will have diagnostics for each of the intermediary queries (temporary table creations, table deletions or inserts, …).
- I usually want to see how my query is doing as it’s running, so I provide a verbosity option.
- fetchQuery attempts to auto-detect whether you’re doing an insert or deletion or selection, and returns a result appropriate to the operation. This algorithm is simple, crude string-matching at this point, and i’d be happy to see an improvement. It hasn’t been a problem for me yet since I am very consistent in my sql syntax.
So, whenever I want to run an oracle query i’ll run something like:
res <- fetchQuery("SELECT * FROM table_name", config.whse)
or if I want to run a query as an admin against our Postgres database
res <- fetchQuery("SELECT * FROM table_name limit 10", config.gp.admin)
The connection fortunately gets closed no matter if the query is in error or not, which is really nice for me and my company’s DBAs (R will limit the number of active database connections you can have open, so it is important to close them).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.