Conquering Caffeinated Amazon Athena with the metis Trio of Packages
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I must preface this post with the posit that if you’re doing anything interactive()
with Amazon Athena you should seriously consider just using their free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided Linux ODBC drivers to work on your flavor of Linux.
However…
There are times—say, when you’re trying to stand up an R service in your kubernetes cluster which bridges data in Athena to analyses & visualizations in R—when ODBC drivers can be more of a hindrance than help and JDBC is the path of least resistance.
Sure, there’s the in-CRAN AWR.Athena
package but it’s a fairly constrained and low-feature RJDBC shim which gets the basic job done but not much more.
Enter:
metis.jars
| GL | GH;metis
| GL | GH; and,metis.tidy
| GL | GH
a trio of packages which aims to make it super-straightforward to wire up R to Amazon Athena when ODBC is not available.
Why Three Packages?
For starters, there are CRAN hopes for the metis
-trio and one key component of that is separating out the JARs into one package (metis.jars
) and actual functionality into others (metis
and metis.tidy
). We’ll see how the CRAN attempt goes since the JAR package weighs in at sufficient weight to warrant a NOTE
. The packaging of the driver reduces the need for you to pre-load the JAR (locally or into, say, a Docker image) or perform a package-initiated download-dance like AWR.Athena
does (which I still don’t understand why that hasn’t kicked it out of CRAN the way it does it but ¯\_(ツ)_/¯).
metis.jars
also has three helper functions which do some (basic) fun things:
library(metis.jars) simba_driver_version() ## [1] "02.00.06.1008" athena_supported_types() ## [1] "BOOLEAN" "TINYINT" "SMALLINT" "INT" "INTEGER" ## [6] "BIGINT" "REAL" "FLOAT" "DOUBLE" "DECIMAL" ## [11] "DATE" "TIMESTAMP" "BINARY" "VARBINARY" "CHAR" ## [16] "VARCHAR" "STRING" "ARRAY" "MAP" "ROW" ## [21] "STRUCT" metis_jar_path() ## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/metis.jars/java/AthenaJDBC42_2.0.6.jar"
The first uses the rJava
interface to directly query the version (since Amazon seems to update the Simba JAR twice a year). By separating out the JAR into a separate package, updates can be made to the other two sibling packages more frequently without crushing CRAN’s disk space. metis.jars
is also versioned to the included JAR so configuration management will be easier for folks.
The reason for the second type-lister function is that there’s hope Amazon will add support for all Presto data types, especially IPADDRESS
. It, again, performs JDBC driver introspection to collect the supported types.
Finally, the third function abstracts the JAR location from the metis
package or even your own interface package should you choose to depend on it.
OK, But Why Not Just Two?
The metis
package is a more functional RJDBC superclass of a DBI wrapper than AWR.Athena
. One thing it does that its CRAN cousin cannot is handle BIGINT
s properly:
library(metis) dbConnect( metis::Athena(), Schema = "sampledb", AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider", AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props") ) -> con dbGetQuery(con, " SELECT CAST('chr' AS CHAR(4)) achar, CAST('varchr' AS VARCHAR) avarchr, CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday, CAST(100.1 AS DOUBLE) AS justadbl, CAST(127 AS TINYINT) AS asmallint, CAST(100 AS INTEGER) AS justanint, CAST(100000000000000000 AS BIGINT) AS abigint, CAST(('GET' = 'GET') AS BOOLEAN) AS is_get, ARRAY[1, 2, 3] AS arr1, ARRAY['1', '2, 3', '4'] AS arr2, MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp, CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw, CAST('{\"a\":1}' AS JSON) js FROM elb_logs LIMIT 1 ") %>% dplyr::glimpse() ## Observations: 1 ## Variables: 13 ## $ achar <chr> "chr " ## $ avarchr <chr> "varchr" ## $ tsday <date> 2014-09-29 ## $ justadbl <dbl> 100.1 ## $ asmallint <int> 127 ## $ justanint <int> 100 ## $ abigint <S3: integer64> 100000000000000000 ## $ is_get <lgl> TRUE ## $ arr1 <chr> "1, 2, 3" ## $ arr2 <chr> "1, 2, 3, 4" ## $ mp <chr> "{bar=2, foo=1}" ## $ rw <chr> "{x=1, y=2.0}" ## $ js <chr> "\"{\\\"a\\\":1}\""
PrestoAthena arrays and maps and rows and JSON come across as characters from the Athena driver and they’re formatted so badly that there’s little hope of full R support for list
columns for them. But, you do get real, big integers with metis
along with full support for all other current Athena types.
R folk who may be users of the old, standalone metis
package need to be aware of some things.
First, dbConnect()
has breaking changes. The snake_case names that still exist in the higher-level athena_jdbc()
function are gone. In exchange for this pain, you now have full naming-parity with all the Athena JDBC connection properties and can more easily use alternate credential providers which metis
‘ cousin totally cannot do for you which is illustrated in the example above and in the package README.
The metis
package also makes it easier to see documentation for all available Athena connection properties since it has a vignette with a descriptive table of all of them (rendered here).
There is also nascent support for the “streaming API” (TLDR: faster result set downloads) but that won’t be fully tested until some AWS policy tweaks happen this week.
Gotcha. But, Why Not Just Two?
As awesome as it is (including base Docker image support) the tidyverse
is not without overhead in terms of compilation time and dependencies, both of which are especially painful on Linux systems and some Docker environments. You can absolutely get by with some well-crafted SQL and JDBC and the thinner the image the easier it is to deploy and scale.
But! The tidyverse is so helpful that ensuring smooth support for Athena is critical. On its own, metis
wires up to dplyr
/dbplyr
fine, but by providing (in metis.tidy
) some enhanced db_data_type()
support (primarily for BIGINT
) and some extra in sql_translate_env()
)for those of us who continue to mindlessly use R-only verbs like grep()
or as.POSIXct()
in non-R contexts) we can level-up interactive()
use and tidyverse-infused service use:
library(metis.tidy) library(dbplyr) library(dplyr) metis::dbConnect( metis::Athena(), Schema = "sampledb", AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider", AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props") ) -> con elb_logs <- tbl(con, "elb_logs") filter(elb_logs, grepl("20", elbresponsecode)) %>% mutate( tsday = as.Date(substring(timestamp, 1L, 10L)), host = url_extract_host(url), proto_version = regexp_extract(protocol, "([[:digit:]\\.]+)"), ) %>% select(tsday, host, receivedbytes, requestprocessingtime, proto_version) %>% head(1) %>% glimpse() ## Observations: ?? ## Variables: 5 ## Database: AthenaConnection ## $ tsday <date> 2014-09-29 ## $ host <chr> "www.abcxyz.com" ## $ receivedbytes <S3: integer64> 0 ## $ requestprocessingtime <dbl> 9.5e-05 ## $ proto_version <chr> "1.1"
FIN
A fairly big impetus for this radical refactoring was the need to use the Athena JDBC interface in R at $DAYJOB in a serverless context. So, if I/we needed it, others may as well. All three packages have tests (that work with my personal Athena setup which is easily replicated since it’s just the default schema & table you get when you enable Athena), pass CRAN checks and will be live in a real production environment by the time you read this.
Note that I do have CRAN plans for these three amigos, but all three packages will need to go in at the same time and I need to get tests into- and prove tests are live in Travis before submitting. Now’s the time for feature requests, problem reports or issues. Until SourceHut’s (sr.ht) API is finished, said contributions are best left to GitLab (preferably) or GitHub (if you must continue to fill the coffers of giant multional companies that undermine your freedom).
POSTSCRIPT
One other reason for re-visiting metis
was this R-crashing rJava
issue that is really a Simba Athena implementation issue (OS signals in a JDBC driver, rly?)
This Rprofile
entry:
options( "java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs") )
has been a solid workaround until rJava
is updated. Note that metis.jars
warns about this on load if it detects your setup is at risk.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.