R Owl of Athena
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RBloggers|RBloggers-feedburner
Intro:
After developing the package RAthena
, I stumbled quite accidentally into the R SDK for AWS paws
. As RAthena
utilises Python’s SDK boto3
I thought the development of another AWS Athena package couldn’t hurt. As mentioned in my previous blog the paws
syntax is very similar to boto3
so alot of my RAthena
code was very portable and this gave me my final excuse to develop my next R package.
paws
and AWS Athena:
Before getting into the next package, lets first look at how the SDK’s interact with AWS Athena.
For example: return all databases in AWS Athena
# create an AWS Athena object athena <- paws::athena() # Submit query to AWS Athena res <- athena$start_query_execution( QueryString = "show Databases", ResultConfiguration = list(OutputLocation = "s3://mybucket/queries/")) # Get Status of query result <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId) # Return results if query is successful if(result$QueryExecution$Status$State == "FAILED") { stop(result$QueryExecution$Status$StateChangeReason, call. = FALSE) } else {output <- athena$get_query_results( QueryExecutionId = res$QueryExecutionId, MaxResults = 10)}
This isn’t the prettiest code when wanting to query AWS Athena with the SQL, in the above example: SHOW DATABASES
. This example only returns the top 10 results. It is even more “interesting” if you wish to return the entire data frame from AWS Athena. This is where noctua
comes in.
noctua
To start off with I will go through the same 3 questions I went through in my Athena and R … there is another way!? blog.
- What is noctua?
noctua
is a R package that creates aDBI
(Database Interface) for R, using the R packageDBI
and the R SDKpaws
as the backend (so basically the same asRAthena
)
- Why was
noctua
created when there are already methods for connecting to Athena?
noctua
was created to provide an extra method to connect to Athena for R users. Plus it seemed natural to createnoctua
due to the nature in how it connects to AWS Athena (through a SDK), which is the methodRAthena
connects to AWS Athena.
- Why is
noctua
callednoctua
?
This is a tricky one as
RAthena
was already taken. So I looked for a historic reference to link the new package to AWS Athena. I settled onnoctua
due to: Athena/Minerva is the Greek/Roman god of wisdom, handicraft, and warfare. One of the main symbols for Athena is the Owl. Noctua is the latin word for Owl.
How to install:
noctua
is currently on the CRAN and Github:
CRAN version:
install.packages("noctua")
Github development version:
remotes::install_github("dyfanjones/noctua")
Usage:
As with all DBI
interface packages the key functions are exactly the same. Which means that there is little to no upskilling required. The only difference between each method is how they connect and send data back to the database. So we will focus mainly on those two aspects.
Connecting to AWS Athena:
noctua
offers a wide range of connection methods from hard coding to using Amazon Resource Name Roles (ARN roles). Which is very similar to the RAthena
package.
Hard-Coding Method:
This method isn’t recommended as your credentials are hard-coded.
library(DBI) con <- dbConnect(noctua::athena(), aws_access_key_id = "YOUR AWS KEY ID", aws_secret_access_key = "YOUR SECRET ACCESS KEY", s3_staging_dir = "s3://path/to/query/bucket/")
Note: s3_staging_dir
requires to be in the format of s3 uri
for example “s3://path/to/query/bucket/”
If you do not wish to create AWS Profiles then setting environmental variables would be the recommended method.
Environment Variable Method:
noctua
supports AWS credentials when set into the environment variables to avoid hard-coding. From what I have found out, an easy way to set up environment variables (that persists) in R is to use the file.edit
function like so:
file.edit("~/.Renviron")
And now you can simply add in your environment variables in the file you are editing for example:
AWS_ACCESS_KEY_ID = YOUR AWS KEY ID
Once you have set your environment variables you can connect to Athena in the following method:
library(DBI) con <- dbConnect(noctua::athena(), s3_staging_dir = "s3://path/to/query/bucket/")
You can set the s3_staging_dir
parameter as an environmental variable, to do this you need to set the following environmental variable:
AWS_ATHENA_S3_STAGING_DIR = s3://path/to/query/bucket/
This allows for the following connection:
library(DBI) con <- dbConnect(noctua::athena())
AWS Profile Names:
Another method is to use AWS Profile Names. AWS profile names can be setup either manually in the ~/.aws
directory or by using the AWS Command Line Interface (AWS CLI). Once you have setup your profile name you can connect to AWS Athena:
Using Default Profile Name:
library(DBI) con <- dbConnect(noctua::athena())
Using Non-Default Profile Name:
library(DBI) con <- dbConnect(noctua::athena(), profile_name = "rathena")
ARN Roles:
ARN roles are fairly useful if you need to assume a role that can connect to another AWS account and use the AWS Athena in that account. Or whether you want to create a temporary connection with different permissions than your current role (AWS ARN Documentation).
Assuming ARN role credentials before connecting to AWS Athena:
library(noctua) library(DBI) assume_role(profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", set_env = TRUE) # Connect to Athena using ARN Role con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Connect to AWS Athena directly using ARN role:
library(DBI) con <- dbConnect(noctua::athena(), profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", s3_staging_dir = 's3://path/to/query/bucket/')
Note: ARN Roles have a duration timer before they will expire. To change the default you can increase the duration_seconds
parameter from the default 3600 seconds (1 hour).
Temporary Sessions:
Finally you can create temporary credentials before connecting to AWS Athena:
library(noctua) library(DBI) # Create Temporary Credentials duration 1 hour get_session_token("YOUR_PROFILE_NAME", serial_number='arn:aws:iam::123456789012:mfa/user', token_code = "531602", set_env = TRUE) # Connect to Athena using temporary credentials con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Note: This method will work for users who have set up Multi-Factor Authentication (MFA).
Querying:
To query AWS Athena using the noctua
it is very similar to querying any other DBI
database method:
library(DBI) con <- dbConnect(noctua::athena()) dbGetQuery(con, "show databases")
That is it! So if we look back at the initial paws
code when working with AWS Athena. The code was very intimidating when wanting to do basic AWS Athena queries. noctua
packages all that up and makes it super easy to work with.
Uploading Data:
It is all very well querying data from AWS Athena but what is more useful is to upload data as well. noctua
has addressed this and implemented a method in dbWriteTable
.
dbWriteTable(con, "mtcars", mtcars, partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")), s3.location = "s3://mybucket/data/")
Once you have uploaded data into AWS Athena
you can query it in the following:
dbGetQuery(con, "select * from mtcars")
Here are all variable parameters for the dbWriteTable
method:
conn: An AthenaConnection object, produced by dbConnect()
name: A character string specifying a table name. Names will be automatically quoted so you can use any sequence of characters, not just any valid bare table name.
value: A data.frame to write to the database.
overwrite: Allow overwriting the destination table. Cannot be ‘TRUE’ if ‘append’ is also ‘TRUE’.
append: Allow appending to the destination table. Cannot be ‘TRUE’ if ‘overwrite’ is also ‘TRUE’.
row.names: Either TRUE, FALSE, NA or a string. If TRUE, always translate row names to a column called “row_names”. If FALSE, never translate row names. If NA, translate rownames only if they’re a character vector. A string is equivalent to TRUE, but allows you to override the default name. For backward compatibility, NULL is equivalent to FALSE.
field.types: Additional field types used to override derived types.
partition: Partition Athena table (needs to be a named list or vector) for example: c(var1 = “2019-20-13”)
s3.location s3 bucket to store Athena table, must be set as a s3 uri for example (“s3://mybucket/data/“)
file.type: What file type to store data.frame on s3, RAthena currently supports [“csv”, “tsv”, “parquet”]. Note: file.type “parquet” is supported by R package
arrow
and will need to be installed separately if you wish to upload data.frames in “parquet” format.…: Other arguments used by individual methods.
Conclusion:
noctua
is a package that gives R users the access to AWS Athena using the R AWS SDK paws
. Thus no external software is required and it can all be installed from the CRAN. If you are interested in how to connect R to AWS Athena please check out RAthena
as well (my other AWS Athena connectivity R package). All feature requests/ suggestions/issues are welcome please add them to: Github Issues.
Finally please star the github repositories if you like the work that has been done with R and AWS Athena noctua
, RAthena
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.