Site icon R-bloggers

How to bulk upload your data from R into Redshift

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Amazon’s columnar database, Redshift is a great companion for a lot of Data Science tasks, it allows for fast processing of very big datasets, with a familiar query language (SQL).

There are 2 ways to load data into Redshift, the classic one, using the INSERT statement, works, but it is highly inefficient when loading big datasets. The other one, recommended in Redshift’s docs, consists on using the COPY statement.

One of the easiests ways to accomplish this, since we are already using Amazon’s infrastructure, is to do a load from S3. S3 loading requires that you upload your data to Redshift and then run a COPY statement specifying where your data is.

Also, because Redshift is a distributed database, they recommend you to split your file, in a number of files which are a multiple of the number of slices on your database, so they can load it in parallel, and they also let you gzip your files for a faster upload.

Wait a second, now to upload a big dataset fast we have to:

That does seem like a lot of work, but don’t worry, i’ve got your back! I’ve created an R Package which does exactly this, it’s redshiftTools! The code is available at github.com/sicarul/redshiftTools.

To install the package, you’ll need to do:

    install.packages('devtools')
    devtools::install_github("RcppCore/Rcpp")
    devtools::install_github("rstats-db/DBI")
    devtools::install_github("rstats-db/RPostgres")
    install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
    devtools::install_github("sicarul/redshiftTools")

Afterwards, you’ll be able to use it like this:

    library("aws.s3")
    library(RPostgres)
    library(redshiftTools)

    con <- dbConnect(RPostgres::Postgres(), dbname="dbname",
    host='my-redshift-url.amazon.com', port='5439',
    user='myuser', password='mypassword',sslmode='require')

    rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket")
    rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))

rs_replace_table truncates the target table and then loads it entirely from the data frame, only do this if you don’t care about the current data it holds. On the other hand, rs_upsert_table replaces rows which have coinciding keys, and inserts those that do not exist in the table.

Please open an issue in Github if you find any issues. Have fun!


To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.