Interacting With Amazon Athena from R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a short post for those looking to test out Amazon Athena with R.
Amazon makes Athena available via JDBC, so you can use RJDBC
to query data. All you need is their JAR file and some setup information. Here’s how to get the JAR file to the current working directory:
URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.0.jar' fil <- basename(URL) if (!file.exists(fil)) download.file(URL, fil)
To avoid putting credentials in code, you can store the AWS key and secret you’re using for the queries in ATHENA_USER
and ATHENA_PASSWORD
environment variables via ~/.Renviron
. You’ll also need an S3 bucket writable by those credentials for the Athena staging directory. With that info in hand, it’s easy to connect:
library(RJDBC) library(dplyr) drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'") con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/', s3_staging_dir="s3://accessible-bucket", user=Sys.getenv("ATHENA_USER"), password=Sys.getenv("ATHENA_PASSWORD"))
Even if you have no data configured in Athena, you can check out the test data available to all:
dbListTables(con) ## [1] "elb_logs"
If that worked, then you should be able to query data (using the fully qualified table name in this case):
dbGetQuery(con, "SELECT * FROM sampledb.elb_logs LIMIT 10") %>% dplyr::glimpse() ## Observations: 10 ## Variables: 16 ## $ timestamp <chr> "2014-09-27T00:00:25.424956Z", "2014-09-27T00:00:56.439218Z", "2014-09-27T00:01:27.441734Z", "2014-09-27T00:01:58.366715Z", "2014-09-27T00:02:29.446363Z", "2014-09-2... ## $ elbname <chr> "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo", "lb-demo" ## $ requestip <chr> "241.230.198.83", "252.26.60.51", "250.244.20.109", "247.59.58.167", "254.64.224.54", "245.195.140.77", "245.195.140.77", "243.71.49.173", "240.139.5.14", "251.192.4... ## $ requestport <dbl> 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026, 27026 ## $ backendip <chr> "251.192.40.76", "249.89.116.3", "251.111.156.171", "251.139.91.156", "251.111.156.171", "254.64.224.54", "254.64.224.54", "250.244.20.109", "247.65.176.249", "250.2... ## $ backendport <dbl> 443, 8888, 8888, 8888, 8000, 8888, 8888, 8888, 8888, 8888 ## $ requestprocessingtime <dbl> 9.1e-05, 9.4e-05, 8.4e-05, 9.7e-05, 9.1e-05, 9.3e-05, 9.4e-05, 8.3e-05, 9.0e-05, 9.0e-05 ## $ backendprocessingtime <dbl> 0.046598, 0.038973, 0.047054, 0.039845, 0.061461, 0.037791, 0.047035, 0.048792, 0.045724, 0.029918 ## $ clientresponsetime <dbl> 4.9e-05, 4.7e-05, 4.9e-05, 4.9e-05, 4.0e-05, 7.7e-05, 7.5e-05, 7.3e-05, 4.0e-05, 6.7e-05 ## $ elbresponsecode <chr> "200", "200", "200", "200", "200", "200", "200", "200", "200", "200" ## $ backendresponsecode <chr> "200", "200", "200", "200", "200", "400", "400", "200", "200", "200" ## $ receivedbytes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ## $ sentbytes <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 ## $ requestverb <chr> "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GET" ## $ url <chr> "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http://www.abcxyz.com:80/jobbrowser/?format=json&state=running&user=20g578y", "http:/... ## $ protocol <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
And, you can disconnect when done:
dbDisconnect(con)
You should probably store the JAR file in a central location and refer to it that way in “production” scripts.
Now, you can go crazy querying data and racking up AWS charges ?.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.