Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Updated on 2020-09-19: I changed the “Creating tables” section. In my original script I used copy_to()
, which creates (unique) indexes, while here I wrote dbWriteTable()
when I adapted my code, which doesn’t create indexes and only adds records.
Updated on 2020-08-11: @jbkunst suggested me a more concise approach to write large tables by using the Tidyverse. I added the explanation to do that and avoid obscure syntax. Thanks Joshua!
Motivation
If, like me, you have your workflows that are well organized, even with tidy GitHub repositories and still you create a lot of CSV or RDS files to share periodical data (i.e. weekly or monthly sales), then you can benefit from using PostgreSQL or any other SQL database engine such as MariaDB.
PostgreSQL and MariaDB are the database engines that I like, but this still applies to MySQL and even to propietary databases such as Oracle Database or SQL Server.
Nycflights13
Let’s start with the flights
table from the quite popular nycflights13
package:
library(nycflights13) flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Assume that you work at NYC airport and you are in charge of updating the flights
table every day. Instead of creating one CSV file per day with flight records, the process can be better organized by appending new records to an existing SQL table.
nycflights13
provides different tables besides flights
. You can explore the details in the superb book R4DS. For this tutorial, the relation between the tables is summarised in the next diagram:
SQL indexes
Skipping a lot of details, SQL creates copies of the tables that contain just a few columns which help to identify observations in order to be able to filter those observations in a faster way that without an index.
For example, in the month
table, the year column is really useful to create an index that boosts the speed of SQL operations similar to this dplyr
statement:
flights %>% filter(month == 3)
To create indexes, let’s start differentiating between unique and non-unique indexes:
unique_index <- list( airlines = list("carrier"), planes = list("tailnum") ) index <- list( airports = list("faa"), flights = list( c("year", "month", "day"), "carrier", "tailnum", "origin", "dest" ), weather = list(c("year", "month", "day"), "origin") )
SQL connections
Let’s assume that the NYC airport gave you access to a PostgreSQL database named postgres
, which is hosted in a server with local IP 142.93.236.14
, the user is postgres
and the password is S4cr3tP4ssw0rd007
. Of course, this is an example with slightly different settings than the defaults.
Databases information such as passwords should be stored in your ~/.Rprofile
and never in your scripts! You can type usethis::edit_r_environ()
and edit your R profile to add the credentials and then restart your R session to apply the changes.
To access and start creating tables to better organize your data, you need to open and close connections, and the RPostgres
package is really useful to do that and more:
library(RPostgres) library(dplyr) fun_connect <- function() { dbConnect( Postgres(), dbname = Sys.getenv("tutorial_db"), user = Sys.getenv("tutorial_user"), password = Sys.getenv("tutorial_pass"), host = Sys.getenv("tutorial_host") ) }
By default, PostgreSQL uses the public
schema in a databases, which you’ll write to. You can explore the existing tables in the public
schema of postgres
database by running:
schema <- "public" conn <- fun_connect() remote_tables <- dbGetQuery( conn, sprintf("SELECT table_name FROM information_schema.tables WHERE table_schema='%s'", schema) ) dbDisconnect(conn) remote_tables <- as.character(remote_tables$table_name) remote_tables character(0)
Because this is an example with a freshly created server just for the tutorial, the result is no existing tables in the database.
Creating tables
Now you should be ready to start writing the tables from nycflights13
to the database:
local_tables <- utils::data(package = "nycflights13")$results[, "Item"] tables <- setdiff(local_tables, remote_tables) for (table in tables) { df <- getExportedValue("nycflights13", table) message("Creating table: ", table) table_name <- table conn <- fun_connect() copy_to( conn, df, table_name, unique_indexes = unique_index[[table]], indexes = index[[table]], temporary = FALSE ) dbDisconnect(conn) }
This approach prevents errors in case that a table already exists. Also, it prevents a bigger error that would be to overwrite an existing table instead of appending new observations.
Accessing tables
You can use dplyr
in a very similar way to what you would do with a local data.frame
. Let’s say you need to count the observations per month and save the result locally:
library(dplyr) conn <- fun_connect() obs_per_month <- tbl(conn, "flights") %>% group_by(year, month) %>% count() %>% collect() dbDisconnect(conn) obs_per_month # A tibble: 12 x 3 # Groups: year [1] year month n <int> <int> <int64> 1 2013 1 27004 2 2013 2 24951 3 2013 3 28834 4 2013 4 28330 5 2013 5 28796 6 2013 6 28243 7 2013 7 29425 8 2013 8 29327 9 2013 9 27574 10 2013 10 28889 11 2013 11 27268 12 2013 12 28135
The previous chunk uses the collect()
function, which stores the result of the query as a local data.frame
, which can be used with ggplot2
, saved as CSV with readr
, etc.
Appending new records
Let’s say you have a table flights14
with records for the year 2014. To append those records to the existing flights
table in the database, you need to run:
conn <- fun_connect() dbWriteTable(conn, "flights", flights14, append = TRUE, overwrite = FALSE, row.names = FALSE) dbDisconnect(conn)
Appending large tables
Let’s say your flights14
contains 12,345,678 rows. A better approach than writing the full table at once is to divide it in parts of 2,500,000 rows (or other smaller number).
Using base R
N <- 2500000 divisions <- seq(N, N * ((nrow(flights14) %/% N) + 1), by = N) flights14_2 <- list() for (j in seq_along(divisions)) { if (j == 1) { flights14_2[[j]] <- flights14[1:divisions[j],] } else { flights14_2[[j]] <- flights14[(divisions[j-1] + 1):divisions[j]] } } rm(flights14) for (j in seq_along(divisions)) { message(sprintf("Writing fragment %s of %s", j, length(divisions))) conn <- fun_connect() dbWriteTable(conn, "flights", flights14_2[[j]], append = TRUE, overwrite = FALSE, row.names = FALSE) dbDisconnect(conn) }
Using the Tidyverse
library(tidyverse) N <- 2500000 flights14 <- flights14 %>% mutate(p = floor(row_number() / N) + 1) %>% group_by(p) %>% nest() %>% ungroup() %>% select(data) %>% pull() map( seq_along(flights14), function(x) { message(sprintf("Writing fragment %s of %s", x, nrow(flights14))) conn <- fun_connect() dbWriteTable(conn, "flights", flights14_2[[j]], append = TRUE, overwrite = FALSE, row.names = FALSE) dbDisconnect(conn) } )
Resources used for this tutorial
I created a 1GB RAM + 25GB disk VPS by using the Supabase Postgres Image from Digital Ocean Marketplace. You can even explore the code on GitHub.
Support more open source projects
You can Buy me a coffee. If you want to try DigitalOcean, please use my referral link which gives you $100 in credits for 60 days and helps me covering the hosting costs for my open source projects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.