Site icon R-bloggers

A Crash Course on PostgreSQL for R Users

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Updated on 2020-09-19: I changed the “Creating tables” section. In my original script I used copy_to(), which creates (unique) indexes, while here I wrote dbWriteTable() when I adapted my code, which doesn’t create indexes and only adds records.

Updated on 2020-08-11: @jbkunst suggested me a more concise approach to write large tables by using the Tidyverse. I added the explanation to do that and avoid obscure syntax. Thanks Joshua!

Motivation

If, like me, you have your workflows that are well organized, even with tidy GitHub repositories and still you create a lot of CSV or RDS files to share periodical data (i.e. weekly or monthly sales), then you can benefit from using PostgreSQL or any other SQL database engine such as MariaDB.

PostgreSQL and MariaDB are the database engines that I like, but this still applies to MySQL and even to propietary databases such as Oracle Database or SQL Server.

Nycflights13

Let’s start with the flights table from the quite popular nycflights13 package:

library(nycflights13)
flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Assume that you work at NYC airport and you are in charge of updating the flights table every day. Instead of creating one CSV file per day with flight records, the process can be better organized by appending new records to an existing SQL table.

nycflights13 provides different tables besides flights. You can explore the details in the superb book R4DS. For this tutorial, the relation between the tables is summarised in the next diagram:

SQL indexes

Skipping a lot of details, SQL creates copies of the tables that contain just a few columns which help to identify observations in order to be able to filter those observations in a faster way that without an index.

For example, in the month table, the year column is really useful to create an index that boosts the speed of SQL operations similar to this dplyr statement:

flights %>% filter(month == 3)

To create indexes, let’s start differentiating between unique and non-unique indexes:

unique_index <- list(
    airlines = list("carrier"),
    planes = list("tailnum")
  )

index <- list(
    airports = list("faa"),
    flights = list(
      c("year", "month", "day"), "carrier", "tailnum", "origin", "dest"
    ),
    weather = list(c("year", "month", "day"), "origin")
  )

SQL connections

Let’s assume that the NYC airport gave you access to a PostgreSQL database named postgres, which is hosted in a server with local IP 142.93.236.14, the user is postgres and the password is S4cr3tP4ssw0rd007. Of course, this is an example with slightly different settings than the defaults.

Databases information such as passwords should be stored in your ~/.Rprofile and never in your scripts! You can type usethis::edit_r_environ() and edit your R profile to add the credentials and then restart your R session to apply the changes.

To access and start creating tables to better organize your data, you need to open and close connections, and the RPostgres package is really useful to do that and more:

library(RPostgres)
library(dplyr)

fun_connect <- function() {
  dbConnect(
    Postgres(),
    dbname = Sys.getenv("tutorial_db"),
    user = Sys.getenv("tutorial_user"),
    password = Sys.getenv("tutorial_pass"),
    host = Sys.getenv("tutorial_host")
  )
}

By default, PostgreSQL uses the public schema in a databases, which you’ll write to. You can explore the existing tables in the public schema of postgres database by running:

schema <- "public"

conn <- fun_connect()
remote_tables <- dbGetQuery(
  conn,
  sprintf("SELECT table_name 
          FROM information_schema.tables 
          WHERE table_schema='%s'", schema)
)
dbDisconnect(conn)

remote_tables <- as.character(remote_tables$table_name)
remote_tables
character(0)

Because this is an example with a freshly created server just for the tutorial, the result is no existing tables in the database.

Creating tables

Now you should be ready to start writing the tables from nycflights13 to the database:

local_tables <- utils::data(package = "nycflights13")$results[, "Item"]
tables <- setdiff(local_tables, remote_tables)

for (table in tables) {
  df <- getExportedValue("nycflights13", table)

  message("Creating table: ", table)

  table_name <- table

  conn <- fun_connect()
  copy_to(
    conn,
    df,
    table_name,
    unique_indexes = unique_index[[table]],
    indexes = index[[table]],
    temporary = FALSE
  )
  dbDisconnect(conn)
}

This approach prevents errors in case that a table already exists. Also, it prevents a bigger error that would be to overwrite an existing table instead of appending new observations.

Accessing tables

You can use dplyr in a very similar way to what you would do with a local data.frame. Let’s say you need to count the observations per month and save the result locally:

library(dplyr)

conn <- fun_connect()
obs_per_month <- tbl(conn, "flights") %>% 
  group_by(year, month) %>% 
  count() %>% 
  collect()
dbDisconnect(conn)

obs_per_month
# A tibble: 12 x 3
# Groups:   year [1]
    year month n      
   <int> <int> <int64>
 1  2013     1 27004  
 2  2013     2 24951  
 3  2013     3 28834  
 4  2013     4 28330  
 5  2013     5 28796  
 6  2013     6 28243  
 7  2013     7 29425  
 8  2013     8 29327  
 9  2013     9 27574  
10  2013    10 28889  
11  2013    11 27268  
12  2013    12 28135  

The previous chunk uses the collect() function, which stores the result of the query as a local data.frame, which can be used with ggplot2, saved as CSV with readr, etc.

Appending new records

Let’s say you have a table flights14 with records for the year 2014. To append those records to the existing flights table in the database, you need to run:

conn <- fun_connect()
dbWriteTable(conn, "flights", flights14, append = TRUE, overwrite = FALSE, row.names = FALSE)
dbDisconnect(conn)

Appending large tables

Let’s say your flights14 contains 12,345,678 rows. A better approach than writing the full table at once is to divide it in parts of 2,500,000 rows (or other smaller number).

Using base R

N <- 2500000
divisions <- seq(N, N * ((nrow(flights14) %/% N) + 1), by =  N)
    
flights14_2 <- list()
    
for (j in seq_along(divisions)) {
  if (j == 1) {
    flights14_2[[j]] <- flights14[1:divisions[j],]
  } else {
    flights14_2[[j]] <- flights14[(divisions[j-1] + 1):divisions[j]]
  }
}
    
rm(flights14)
    
for (j in seq_along(divisions)) {
  message(sprintf("Writing fragment %s of %s", j, length(divisions)))
  conn <- fun_connect()
  dbWriteTable(conn, "flights", flights14_2[[j]],
    append = TRUE, overwrite = FALSE, row.names = FALSE)
  dbDisconnect(conn)
}

Using the Tidyverse

library(tidyverse)

N <- 2500000

flights14 <- flights14 %>%
  mutate(p = floor(row_number() / N) + 1) %>% 
  group_by(p) %>% 
  nest() %>% 
  ungroup() %>% 
  select(data) %>% 
  pull()

map(
  seq_along(flights14),
  function(x) {
    message(sprintf("Writing fragment %s of %s", x, nrow(flights14)))
    conn <- fun_connect()
    dbWriteTable(conn, "flights", flights14_2[[j]],
      append = TRUE, overwrite = FALSE, row.names = FALSE)
    dbDisconnect(conn)
  }
)

Resources used for this tutorial

I created a 1GB RAM + 25GB disk VPS by using the Supabase Postgres Image from Digital Ocean Marketplace. You can even explore the code on GitHub.

Support more open source projects

You can Buy me a coffee. If you want to try DigitalOcean, please use my referral link which gives you $100 in credits for 60 days and helps me covering the hosting costs for my open source projects.

To leave a comment for the author, please follow the link and comment on their blog: Pachá.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.