Using PostgreSQL in R: A quick how-to
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.
We call this work pattern “SQL Screwdriver”: delegating data manipulation to a lightweight infrastructure with the power of SQL for data manipulation.
We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.
For the rest of this post, we give a quick how-to on using the RpostgreSQL
package to interact with Postgres databases in R.
You have your PostgresSQL database up and running. Now you want to work with the data in that database in R. First, let’s create a data frame that we want to insert into the database.
# An example data frame to play with iris = as.data.frame(iris) summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50
The column names of this data frame are problematic for databases (and especially PostgreSQL) for a few reasons: the “.”s in the names can be an issue, and PostgreSQL expects column names to be all lowercase. Here’s a function to make the column names db safe:
# make names db safe: no '.' or other illegal characters, # all lower case and unique dbSafeNames = function(names) { names = gsub('[^a-z0-9]+','_',tolower(names)) names = make.names(names, unique=TRUE, allow_=TRUE) names = gsub('.','_',names, fixed=TRUE) names } colnames(iris) = dbSafeNames(colnames(iris)) summary(iris) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## setosa :50 ## versicolor:50 ## virginica :50
Now let’s open up a database connection and insert the table.
# Create a connection to the database library('RPostgreSQL') ## Loading required package: DBI pg = dbDriver("PostgreSQL") # Local Postgres.app database; no password by default # Of course, you fill in your own database information here. con = dbConnect(pg, user="ninazumel", password="", host="localhost", port=5432, dbname="ninazumel") # write the table into the database. # use row.names=FALSE to prevent the query # from adding the column 'row.names' to the table # in the db dbWriteTable(con,'iris',iris, row.names=FALSE) ## [1] TRUE
The function dbWriteTable()
returns TRUE
if the table was successfully written. Note this call will fail if iris
already exists in the database. Use overwrite=TRUE
to force overwriting of an existing table, and append=TRUE
to append to an existing table.
Now you can read the table back out.
# read back the full table: method 1 dtab = dbGetQuery(con, "select * from iris") summary(dtab) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## Length:150 ## Class :character ## Mode :character ## # read back the full table: method 2 rm(dtab) dtab = dbReadTable(con, "iris") summary(dtab) ## sepal_length sepal_width petal_length petal_width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## species ## Length:150 ## Class :character ## Mode :character
Of course, the point of using a database is to extract subsets or transformations of your data, using SQL.
# get part of the table rm(dtab) dtab = dbGetQuery(con, "select sepal_length, species from iris") summary(dtab) ## sepal_length species ## Min. :4.300 Length:150 ## 1st Qu.:5.100 Class :character ## Median :5.800 Mode :character ## Mean :5.843 ## 3rd Qu.:6.400 ## Max. :7.900
You can use dbSendQuery
for sending queries that don’t return a data-frame-like result.
# remove table from database dbSendQuery(con, "drop table iris") # commit the change dbCommit(con)
When you are done, disconnect.
# disconnect from the database dbDisconnect(con)
And that’s it!
Extra: RPostgreSQL
and sqldf
If you are accustomed to manipulating data with SQL, you may prefer SQL notation to the sometimes convoluted calling conventions of the analogous R operations like aggregate()
, or the functions in the dplyr
package. In this case you have probably already discovered the sqldf
package, which allows you to manipulate data frames using SQL. If you are a sqldf
user, there is an additional subtlety if you are also using RPostgreSQL
or other R packages for talking to databases: sqldf
uses its own internal (and ephemeral) database to perform its operations, but if RPostgreSQL
is loaded, sqldf
will pick up your PostgreSQL driver by default. This is probably not what you want.
options(gsubfn.engine = "R") library(sqldf) ## Loading required package: gsubfn ## Loading required package: proto ## Loading required package: RSQLite ## sqldf will default to using PostgreSQL
To use sqldf
on local data frames, you must specify the driver and dbname explicitly.
sqldf(query, drv="SQLite", dbname=":memory:") ## avg_sepal_length species ## 1 5.006 setosa ## 2 5.936 versicolor ## 3 6.588 virginica
Extra: PostgreSQL and dplyr
If you do use dplyr
, the good news is that you can connect to a PostgreSQL database directly through the dplyr
function src_postgres()
.
library('dplyr') # Connect to local PostgreSQL via dplyr localdb <- src_postgres(dbname = '', host = 'localhost', port = 5432, user = 'ninazumel', password = '') # cheat and access the db connection directly # assume we have made the colnames db safe dbWriteTable(localdb$con,'iris',iris, row.names=FALSE) ## [1] TRUE
The tbl()
command lets you access tables in the database remotely, and sql()
lets you send queries.
# this is not a data frame; it's a dplyr PostgreSQL handle into the database d = tbl(localdb, "iris") d ## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel] ## From: iris [150 x 5] ## ## sepal_length sepal_width petal_length petal_width species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## .. ... ... ... ... ... # this is a data frame dtab = as.data.frame(d) # send a query through dplyr query = "select avg(sepal_length) avg_sepal_length, species from iris group by species" dsub = tbl(localdb, sql(query)) dsub ## Source: postgres 9.5.0 [ninazumel@localhost:5432/ninazumel] ## From: [?? x 2] ## ## avg_sepal_length species ## 1 5.936 versicolor ## 2 6.588 virginica ## 3 5.006 setosa ## .. ... ... # make it local dsub = as.data.frame(dsub) summary(dsub) ## avg_sepal_length species ## Min. :5.006 Length:3 ## 1st Qu.:5.471 Class :character ## Median :5.936 Mode :character ## Mean :5.843 ## 3rd Qu.:6.262 ## Max. :6.588 # shuts down database rm(list=c('d','localdb')); gc()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.