Conquering Caffeinated Amazon Athena with the metis Trio of Packages

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I must preface this post with the posit that if you’re doing anything interactive() with Amazon Athena you should seriously consider just using their free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided Linux ODBC drivers to work on your flavor of Linux.

However

There are times—say, when you’re trying to stand up an R service in your kubernetes cluster which bridges data in Athena to analyses & visualizations in R—when ODBC drivers can be more of a hindrance than help and JDBC is the path of least resistance.

Sure, there’s the in-CRAN AWR.Athena package but it’s a fairly constrained and low-feature RJDBC shim which gets the basic job done but not much more.

Enter:

a trio of packages which aims to make it super-straightforward to wire up R to Amazon Athena when ODBC is not available.

Why Three Packages?

For starters, there are CRAN hopes for the metis-trio and one key component of that is separating out the JARs into one package (metis.jars) and actual functionality into others (metis and metis.tidy). We’ll see how the CRAN attempt goes since the JAR package weighs in at sufficient weight to warrant a NOTE. The packaging of the driver reduces the need for you to pre-load the JAR (locally or into, say, a Docker image) or perform a package-initiated download-dance like AWR.Athena does (which I still don’t understand why that hasn’t kicked it out of CRAN the way it does it but ¯\_(ツ)_/¯).

metis.jars also has three helper functions which do some (basic) fun things:

library(metis.jars)

simba_driver_version()
## [1] "02.00.06.1008"

athena_supported_types()
##  [1] "BOOLEAN"   "TINYINT"   "SMALLINT"  "INT"       "INTEGER"  
##  [6] "BIGINT"    "REAL"      "FLOAT"     "DOUBLE"    "DECIMAL"  
## [11] "DATE"      "TIMESTAMP" "BINARY"    "VARBINARY" "CHAR"     
## [16] "VARCHAR"   "STRING"    "ARRAY"     "MAP"       "ROW"      
## [21] "STRUCT"   

metis_jar_path()
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/metis.jars/java/AthenaJDBC42_2.0.6.jar"

The first uses the rJava interface to directly query the version (since Amazon seems to update the Simba JAR twice a year). By separating out the JAR into a separate package, updates can be made to the other two sibling packages more frequently without crushing CRAN’s disk space. metis.jars is also versioned to the included JAR so configuration management will be easier for folks.

The reason for the second type-lister function is that there’s hope Amazon will add support for all Presto data types, especially IPADDRESS. It, again, performs JDBC driver introspection to collect the supported types.

Finally, the third function abstracts the JAR location from the metis package or even your own interface package should you choose to depend on it.

OK, But Why Not Just Two?

The metis package is a more functional RJDBC superclass of a DBI wrapper than AWR.Athena. One thing it does that its CRAN cousin cannot is handle BIGINTs properly:

library(metis)

dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

dbGetQuery(con, "
SELECT
  CAST('chr' AS CHAR(4)) achar,
  CAST('varchr' AS VARCHAR) avarchr,
  CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
  CAST(100.1 AS DOUBLE) AS justadbl,
  CAST(127 AS TINYINT) AS asmallint,
  CAST(100 AS INTEGER) AS justanint,
  CAST(100000000000000000 AS BIGINT) AS abigint,
  CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
  ARRAY[1, 2, 3] AS arr1,
  ARRAY['1', '2, 3', '4'] AS arr2,
  MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
  CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
  CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
") %>% 
  dplyr::glimpse()
## Observations: 1
## Variables: 13
## $ achar     <chr> "chr "
## $ avarchr   <chr> "varchr"
## $ tsday     <date> 2014-09-29
## $ justadbl  <dbl> 100.1
## $ asmallint <int> 127
## $ justanint <int> 100
## $ abigint   <S3: integer64> 100000000000000000
## $ is_get    <lgl> TRUE
## $ arr1      <chr> "1, 2, 3"
## $ arr2      <chr> "1, 2, 3, 4"
## $ mp        <chr> "{bar=2, foo=1}"
## $ rw        <chr> "{x=1, y=2.0}"
## $ js        <chr> "\"{\\\"a\\\":1}\""

PrestoAthena arrays and maps and rows and JSON come across as characters from the Athena driver and they’re formatted so badly that there’s little hope of full R support for list columns for them. But, you do get real, big integers with metis along with full support for all other current Athena types.

R folk who may be users of the old, standalone metis package need to be aware of some things.

First, dbConnect() has breaking changes. The snake_case names that still exist in the higher-level athena_jdbc() function are gone. In exchange for this pain, you now have full naming-parity with all the Athena JDBC connection properties and can more easily use alternate credential providers which metis‘ cousin totally cannot do for you which is illustrated in the example above and in the package README.

The metis package also makes it easier to see documentation for all available Athena connection properties since it has a vignette with a descriptive table of all of them (rendered here).

There is also nascent support for the “streaming API” (TLDR: faster result set downloads) but that won’t be fully tested until some AWS policy tweaks happen this week.

Gotcha. But, Why Not Just Two?

As awesome as it is (including base Docker image support) the tidyverse is not without overhead in terms of compilation time and dependencies, both of which are especially painful on Linux systems and some Docker environments. You can absolutely get by with some well-crafted SQL and JDBC and the thinner the image the easier it is to deploy and scale.

But! The tidyverse is so helpful that ensuring smooth support for Athena is critical. On its own, metis wires up to dplyr/dbplyr fine, but by providing (in metis.tidy) some enhanced db_data_type() support (primarily for BIGINT) and some extra ???? in sql_translate_env() )for those of us who continue to mindlessly use R-only verbs like grep() or as.POSIXct() in non-R contexts) we can level-up interactive() use and tidyverse-infused service use:

library(metis.tidy)
library(dbplyr)
library(dplyr)

metis::dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

elb_logs <- tbl(con, "elb_logs")

filter(elb_logs, grepl("20", elbresponsecode)) %>%
  mutate(
    tsday = as.Date(substring(timestamp, 1L, 10L)),
    host = url_extract_host(url),
    proto_version = regexp_extract(protocol, "([[:digit:]\\.]+)"),
  ) %>%
  select(tsday, host, receivedbytes, requestprocessingtime, proto_version) %>%
  head(1) %>%
  glimpse()
## Observations: ??
## Variables: 5
## Database: AthenaConnection
## $ tsday                 <date> 2014-09-29
## $ host                  <chr> "www.abcxyz.com"
## $ receivedbytes         <S3: integer64> 0
## $ requestprocessingtime <dbl> 9.5e-05
## $ proto_version         <chr> "1.1"

FIN

A fairly big impetus for this radical refactoring was the need to use the Athena JDBC interface in R at $DAYJOB in a serverless context. So, if I/we needed it, others may as well. All three packages have tests (that work with my personal Athena setup which is easily replicated since it’s just the default schema & table you get when you enable Athena), pass CRAN checks and will be live in a real production environment by the time you read this.

Note that I do have CRAN plans for these three amigos, but all three packages will need to go in at the same time and I need to get tests into- and prove tests are live in Travis before submitting. Now’s the time for feature requests, problem reports or issues. Until SourceHut’s (sr.ht) API is finished, said contributions are best left to GitLab (preferably) or GitHub (if you must continue to fill the coffers of giant multional companies that undermine your freedom).

POSTSCRIPT

One other reason for re-visiting metis was this R-crashing rJava issue that is really a Simba Athena implementation issue (OS signals in a JDBC driver, rly?)

This Rprofile entry:

options(
  "java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs")
)

has been a solid workaround until rJava is updated. Note that metis.jars warns about this on load if it detects your setup is at risk.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)