Site icon R-bloggers

R⁶ — Using R With Amazon Athena & AWS Temporary Security Credentials

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of the examples of working with most of the AWS services show basic username & password authentication. That’s all well-and-good, but many shops use the AWS Security Token Service to provide temporary credentials and session tokens to limit exposure and provide more uniform multi-factor authentication. At my workplace, Frank Mitchell created a nice electron app to make it super easy to create and re-up these credentials. The downside of this is that all AWS service usage for work requires using these credentials and I was having the darndest time trying to get Athena’s JDBC driver working with it (but I wasn’t spending alot of time on it as I tend to mirror research data to a local, beefy Apache Drill server).

I finally noticed the

com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider

class and decided to give the following a go (you will need to point fil to wherever you have the Athena jar file):

library(RJDBC)
library(tidyverse)

fil <- "~/Drivers/AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

aws <- ini::read.ini("~/.aws/credentials")

Sys.setenv(AWS_ACCESS_KEY_ID = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_access_key_id)
Sys.setenv(AWS_SECRET_ACCESS_KEY = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_secret_access_key)
Sys.setenv(AWS_SESSION_TOKEN = aws[Sys.getenv("AWS_PROFILE")][[1]]$aws_session_token)

provider <- "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider"

con <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                 s3_staging_dir=Sys.getenv("AWS_S3_STAGING_DIR"),
                 schema_name="DEFAULT_DB_SCHEMA_NAME",
                 aws_credentials_provider_class=provider)

dbListTables(con)

dbListFields(con, "SOME_TABLE_IN_THE_DEFAULT_DB")

dbGetQuery(con, "SELECT * FROM DEFAULT_DB_SCHEMA_NAME.SOME_TABLE_IN_THE_DEFAULT_DB limit 10;")

YMMV on Windows (comments about what does and does not work on Windows are welcome).

The provider line was the key element I was missing prior to last night.

The Awsaml utility monitors/maintains entries under it’s purview credentials file and keeps consistent profile ids, so I keep that AWS_PROFILE setting in my ~/.Renviron.

I also keep the default S3 Athena data staging bucket in an environment variable as well.

If you provide a default schema_name then you can list tables and fields but queries need fully qualified database (Amazon calls them “schemas”) dot table name.

Initial attempts to have this setup “just work” with dplyr 0.6.0 (the forthcoming EPIC release) were unsuccessful but I’ll poke at all this when I get time and likely write a small Athena package to help smooth over rougher areas.

Y’all likely figured all this out way before I did, but in the event someone else is looking for the information, it should be google-able now.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.