Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RBloggers|RBloggers-feedburner
Intro:
Currently there are two key ways in connecting to Amazon Athena from R, using the ODBC and JDBC drivers. To access the ODBC driver R users can use the excellent odbc package supported by Rstudio. To access the JDBC driver R users can either use the RJDBC R package or the helpful wrapper package AWR.Athena which wraps the RJDBC
package to make the connection to Amazon Athena through the JDBC driver simpler. These methods are an excellent way for R to connect to Amazon Athena, however is there another way?
Well glad you asked…yes there is! Ever since the reticulate package was developed (by Rstudio) the interface into Python from R has never been simpler. This makes another route into Athena possible! Amazon has developed a Python software development kit (SDK) called Boto3. By using boto3
in combination with the R package reticulate
a new method into accessing Athena can be made possible. Introducing the R package RAthena.
RAthena:
- What is RAthena?
Rathena
is a R package that creates a DBI (Database Interface) for R, using the R package DBI and the Python packageBoto3
as the backend.
- Why was RAthena created when there are already methods for connecting to Athena?
RAthena
was created to provide an extra method to connect to Athena for R users. Nothing more, nothing less.
- Why is RAthena call RAthena?
Isn’t it obvious? Most R packages that interface with a database are called
"R<database>"
for exampleRSQLite
,RPostgreSQL
, etc… Plus this package is “roughly” the R equivalent to the superbPython
package PyAthena. So calling this packageRAthena
seems like the best choice.
Getting Started:
Now lets get into how to actually use RAthena
. I am going to skip over the part were you have to set up an Amazon Web Services Account (AWS Acount) and get straight into the good stuff.
Before working with RAthena
, Python 3+ is require. Please install it directly or use the Anaconda Distribution.
Now we have Python 3+
we now need to install Boto3
. If you installed Python 3+
with the Anaconda Distribution I believe Boto3
comes as standard (you can skip boto3
installation step), but for everyone else you can install Boto3
either by the pip
command or the inbuilt installation function in RAthena
.
pip command:
pip install boto3
Install RAthena
:
remotes::install_github("dyfanjones/rathena")
RAthena
boto3 installation function:
RAthena::install_boto()
Usage
Connecting to Athena
Now we have everything that is required we are now ready to connect to Athena
. RAthena
provides several method to connect to Athena
ranging from hard-coding credentials to using Amazon Resource Name Roles (ARN roles).
Hard-Coding Method:
This method isn’t recommended as your credentials are hard-coded.
library(DBI) con <- dbConnect(RAthena::athena(), aws_access_key_id = "<YOUR AWS KEY ID>", aws_secret_access_key = "<YOUR SECRET ACCESS KEY>", s3_staging_dir = "<LOCATION FOR ATHENA QUERY OUTPUT>")
Note: s3_staging_dir
requires to be in the format of s3 uri
for example “s3://path/to/query/bucket/”
Environment Variable Method:
RAthena
supports AWS credentials set into the environment variables to avoid hard-coding. From what I have found out an easy way to set up environment variables (that persists) in R is to use the file.edit
function like so:
file.edit("~/.Renviron")
And now you can simply add in your environment variables in the file you are editing for example:
AWS_ACCESS_KEY_ID = <YOUR AWS KEY ID>
Once you have set your environment variables you can connect to Athena in the following method:
library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = "<LOCATION FOR ATHENA QUERY OUTPUT>")
AWS Profile Names:
Another method is to use AWS Profile Names. AWS profile names can be setup either manually in the ~/.aws
directory or by using the AWS Command Line Interface (AWS CLI). Once you have setup your profile name you can connect to Athena
:
Using Default Profile Name:
library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = "<LOCATION FOR ATHENA QUERY OUTPUT>")
Using Non-Default Profile Name:
library(DBI) con <- dbConnect(RAthena::athena(), profile_name = "rathena", s3_staging_dir = "<LOCATION FOR ATHENA QUERY OUTPUT>")
ARN Roles:
ARN roles are fairly useful if you need to assume a role that can connect to another AWS account and use the Athena
in that account. Or whether you want to create a temporary connection with different permissions than your current role (AWS ARN Documentation).
Assuming ARN role credentials before connecting to Athena:
library(RAthena) library(DBI) assume_role(profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", set_env = TRUE) # Connect to Athena using ARN Role con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Connect to Athena directly using ARN role:
library(DBI) con <- dbConnect(athena(), profile_name = "YOUR_PROFILE_NAME", role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name", s3_staging_dir = 's3://path/to/query/bucket/')
Note: ARN Roles have a duration timer before they will expire. To change the default you can increase the duration_seconds
parameter from the default 3600 seconds (1 hour).
Temporary Sessions:
Finally you can create temporary credentials before connecting to Athena
.
library(RAthena) library(DBI) # Create Temporary Credentials duration 1 hour get_session_token("YOUR_PROFILE_NAME", serial_number='arn:aws:iam::123456789012:mfa/user', token_code = "531602", set_env = TRUE) # Connect to Athena using temporary credentials con <- dbConnect(athena(), s3_staging_dir = "s3://path/to/query/bucket/")
Note: This method will work for users who have set up Multi-Factor Authentication (MFA).
Basic Usage
Now we have created a connection to Athena
we can ulitise DBI
methods to query Athena
for example:
All available tables in Athena:
dbListTables(con)
Send Query to Athena
res <- dbSendQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbFetch(res) dbClearResult(res)
Or …
res <- dbExecute(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbFetch(res) dbClearResult(res)
Or …
dbGetQuery
wraps sending, fetching and clearing results in one easy step.
dbGetQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS")
Note: You might of noticed that if you have data.table
installed, RAthena
will attempt to return the data as a data.table
. This is to improve speed when larger queries are returned from Athena
. If you don’t have data.table
, RAthena
will return the output as a data.frame
.
Get Column information
res <- dbSendQuery(con, "SELECT * FROM INFORMATION_SCHEMA.COLUMNS") dbColumnInfo(res) dbClearResult(res)
To learn about what DBI
methods have been implemented in RAthena
please refer to: link.
Intermediate Usage:
It is all very well querying data from Athena
but what is more useful is to upload data as well. RAthena
has addressed this and implemented a method in dbWriteTable
dbWriteTable(con, "mtcars", mtcars, partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")), s3.location = "s3://mybucket/data/")
Once you have uploaded data into Athena
you can query it in the following:
dbGetQuery(con, "select * from mtcars")
Here are all variable parameters for the dbWriteTable
method:
conn: An AthenaConnection object, produced by dbConnect()
name: A character string specifying a table name. Names will be automatically quoted so you can use any sequence of characters, not just any valid bare table name.
value: A data.frame to write to the database.
overwrite: Allow overwriting the destination table. Cannot be ‘TRUE’ if ‘append’ is also ‘TRUE’.
append: Allow appending to the destination table. Cannot be ‘TRUE’ if ‘overwrite’ is also ‘TRUE’.
row.names: Either TRUE, FALSE, NA or a string. If TRUE, always translate row names to a column called “row_names”. If FALSE, never translate row names. If NA, translate rownames only if they’re a character vector. A string is equivalent to TRUE, but allows you to override the default name. For backward compatibility, NULL is equivalent to FALSE.
field.types: Additional field types used to override derived types.
partition: Partition Athena table (needs to be a named list or vector) for example: c(var1 = “2019-20-13”)
s3.location s3 bucket to store Athena table, must be set as a s3 uri for example (“s3://mybucket/data/“)
file.type: What file type to store data.frame on s3, RAthena currently supports [“csv”, “tsv”, “parquet”]. Note: file.type “parquet” is supported by R package
arrow
and will need to be installed separately if you wish to upload data.frames in “parquet” format.…: Other arguments used by individual methods.
Tidyverse Usage:
RAthena
can integrate with the famous R package dplyr.
library(DBI) library(dplyr) con <- dbConnect(RAthena::athena(), profile_name = "rathena", s3_staging_dir = "<LOCATION FOR ATHENA QUERY OUTPUT>") tbl(con, sql("SELECT * FROM INFORMATION_SCHEMA.COLUMNS"))
Or if you have already uploaded a table into Athena
tbl(con, "mtcars")
Conclusion:
So hopefully this has given you an insight into the up coming package RAthena
and it’s usefulness. This package is not meant to replace any of the other packages that connect into Athena
but give another route into Athena
for R users.
Final note: RAthena
offers a lot more functionality please check it out at Github. If you have any suggestions please raise an issue at Github Issues.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.