Metadata : from PostgreSQL comments to R labels
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In PostgreSQL metadata can be stored in comments with the statements COMMENT ON COLUMN...
or COMMENT ON TABLE...
. So I hope your tables have this nice comments, and you can see them in PgAdmin for example. But what about R ?
In R metadata can be assigned as attributes of any object and mainly as « labels ». We will use the Hmisc
package which provides functions to manage labels. Another interesting package is sjlabelled
. You may also have seen labels when importing labelled data from SPSS for example.
library(Hmisc) library(tidyverse) library(RPostgreSQL) library(glue) cnx <- dbConnect(dbDriver("PostgreSQL"), user = "***", password = "***", host = "***", dbname = "***", port = 5432 )
Usually we got some data this way :
cog <- dbGetQuery(cnx, "SELECT * FROM ref_cog.france2018 LIMIT 10")
We can create a function that will query the metadata of the table in information_schema.columns
and add it to the data frame ; the function expects a data frame, the name of the schema.table from which we get the comments and a connection handler. It will return the data frame with labels :
#' Add attributes to a dataframe from metadata read in the PostgreSQL database #' #' @param df dataframe #' @param schema_table "schema.table" from which to read the comments #' @param cnx a database connexion from RPostgreSQL::dbConnect() #' #' @return a dataframe with attributes #' #' @examples \dontrun{add_metadata(iris, "public.iris", cnx)} add_metadata <- function(df, schema_table, cnx) { meta <- str_match(schema_table, "^(.*)\\.(.*)$") %>% glue_sql( "SELECT column_name, pg_catalog.col_description( format('%s.%s', isc.table_schema, isc.table_name)::regclass::oid, isc.ordinal_position) AS column_description FROM information_schema.columns AS isc WHERE isc.table_schema = {s[2]} AND isc.table_name = {s[3]};", s = ., .con = cnx) %>% dbGetQuery(cnx, .) label(df, self = FALSE) <- colnames(df) %>% enframe() %>% left_join(meta, by = c("value" = "column_name")) %>% pull(column_description) df }
Now we would do :
cog <- dbGetQuery(cnx, "SELECT * FROM ref_cog.france2018 LIMIT 10") %>% add_metadata("ref_cog.france2018", cnx)
You can see the metadata in the column headings of the RStudio viewer with View(cog)
or with contents(cog)
. Or with :
cog %>% label() %>% enframe()
Or lastly, for one column :
label(cog$cheflieu)
We can also search for information in the variable names or in the labels with another function :
search_var <- function(df, keyword) { df %>% label() %>% enframe() %>% rename(variable = name, metadata = value) %>% filter_all(any_vars(str_detect(., regex(keyword, ignore_case = TRUE)))) } search_var(cog, "canton")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.