Site icon R-bloggers

The next package release into AWS Athena

[This article was first published on Dyfan Jones Brain Dump HQ, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RBloggers|RBloggers-feedburner

RAthena 1.7.1 and noctua 1.5.1 package versions have now been released to the CRAN. They both bring along several improvements with the connection to AWS Athena, noticeably the performance speed and several creature comforts.

These packages have both been designed to reflect one another,even down to how they connect to AWS Athena. This means that all features going forward will exist in both packages. I will refer to these packages as one, as they basically work in the same way.

Performance improvements:

Initially the packages utilised AWS Athena SQL queries. This was to achieve all the functional requirements of the DBI package framework. However the package would always send a SQL query to AWS Athena which in turn would have to lift a flat file from AWS S3, before returning the final result to R. This means the performance of the packages would be limited and fairly slow compared to other data base backends.

The biggest change is the adoption of more functionality of the SDKs (software development kit) into AWS. The key component that has been adopted is AWS Glue. AWS Glue contains all of AWS Athena table DDL’s. This means instead of going to AWS Athena for this information AWS Glue can be used instead.

By utilising AWS Glue, the table data (column names, column types, schema hierarchy etc…) can easily be retrieved at a fraction of the time it would of taken to query AWS Athena. Previously the DBI function dbListTables would send a query to AWS Athena, this would retrieve all the tables listed in all schemas. This would take over 3 seconds. Now using AWS Glue to retrieve the same data, it takes less than 0.5 of a second.

dplyr

When AWS Glue is used to collect metadata around a table in AWS Athena, a performance in dplyr::tbl can be done. I would like to say thanks to @OssiLehtinen for developing the initial implementation as this improvement would have been overlooked.

dplyr::tbl has two key methods when creating the initial object. The first is called SQL identifiers and this is the method that benefits from the new AWS Glue functionality. To use SQL identifiers is fairly straight forward.

library(DBI)
library(dplyr)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

dbWriteTable(con, "iris", iris)

ident_iris = tbl(con, "iris")

dplyr can identify the iris table within the connected schema. When a user uses the SQL identifier method in dplyr::tbl, AWS Glue is called to retrieve all the data for dplyr. This increases the performance from 3.66 to 0.29 seconds. The second method is called SQL sub query. This unfortunately won’t benefit from the new feature and will run in slower at 3.66 seconds.

subquery_iris = tbl(con, sql("select * from iris"))

Therefore I recommend the use of SQL identifier method when using dplyr's interface.

Creature Comforts

AWS Athena Metadata

Due to user feature requests the packages now return more data around each query sent to AWS Athena. Thus the basic level of data returned, is the amount of data scanned by AWS Athena. This is formatted into a readable format depending on the amount of data scanned.

library(DBI)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

dbWriteTable(con, "iris", iris)

dbGetQuery(con, "select * from iris")

Info: (Data scanned: 860 Bytes)
     sepal_length sepal_width petal_length petal_width   species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

However if you set the new parameter statistics to TRUE then all the metadata around that query is printed out like so:

dbGetQuery(con, "select * from iris", statistics = TRUE)

$EngineExecutionTimeInMillis
[1] 1568

$DataScannedInBytes
[1] 860

$DataManifestLocation
character(0)

$TotalExecutionTimeInMillis
[1] 1794

$QueryQueueTimeInMillis
[1] 209

$QueryPlanningTimeInMillis
[1] 877

$ServiceProcessingTimeInMillis
[1] 17

Info: (Data scanned: 860 Bytes)
     sepal_length sepal_width petal_length petal_width   species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
  4:          4.6         3.1          1.5         0.2    setosa
  5:          5.0         3.6          1.4         0.2    setosa
 ---                                                            
146:          6.7         3.0          5.2         2.3 virginica
147:          6.3         2.5          5.0         1.9 virginica
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

This can also be retrieved by using dbStatistics:

res = dbExecute(con, "select * from iris")

# return query statistic
query_stats = dbStatistics(res)

# return query results
dbFetch(res)

# Free all resources
dbClearResult(res)

RJDBC inspired function

I have to give full credit to the package RJDBC for inspiring me to create this function. DBI has got a good function called dbListTables that will list all the tables that are in AWS Athena. However it won’t return to which schema each individual table is related to. To over come this RJDBC has a excellent function called dbGetTables. This function returns all the tables from AWS Athena as a data.frame. This has the advantage of detailing schema, table and table type. With the new integration into AWS Glue this can be returned quickly.

dbGetTables(con)

      Schema             TableName      TableType
 1:  default             df_bigint EXTERNAL_TABLE
 2:  default                  iris EXTERNAL_TABLE
 3:  default               mtcars2 EXTERNAL_TABLE
 4:  default         nyc_taxi_2018 EXTERNAL_TABLE

This just makes it a little bit easier when working in different IDE’s for example Jupyter.

Backend option changes

This is not really a creature comfort but it still interesting and useful. Both packages are dependent on data.table to read data into R. This is down to the amazing speed data.table offers when reading files into R. However a new package, with equally impressive read speeds, has come onto the scene called vroom. As vroom has been designed to only read data into R similarly to readr, data.table is still used for all of the heavy lifting. However if a user wishes to use vroom as the file parser an *_options function has been created to enable this:

nocuta_options(file_parser = c("data.table", "vroom"))

# Or 

RAthena__options(file_parser = c("data.table", "vroom"))

By setting the file_parser to vroom then the backend will change to allow vroom's file parser to be used instead of data.table.

If you aren’t sure whether to use vroom over data.table, I draw your attention to vroom boasting a whopping 1.40GB/sec throughput.

Statistics taken from vroom’s github readme

package version time (sec) speed-up throughput
vroom 1.1.0 1.14 58.44 1.40 GB/sec
data.table 1.12.8 11.88 5.62 134.13 MB/sec
readr 1.3.1 29.02 2.30 54.92 MB/sec
read.delim 3.6.2 66.74 1.00 23.88 MB/sec

RStudio Interface!

Due to the ability of AWS Glue to retrieve metadata for AWS Athena at speed, it has now been possible to add the interface into RStudio’s connection tab. When a connection is established:

library(DBI)
library(RAthena) #Or library(noctua)

con = dbConnect(athena())

The connection icon will as follows:

The AWS region you are connecting to will be reflected in the connection (highlighted above in the red square). This is to help users that are able to connect to multiple different AWS Athena over different regions.

Once you have connected AWS Athena, schema hierarchy will be displayed. In my example you can see some of the tables I have created when testing these packages.

For more information around RStudio’s connection tab please check out RStudio preview connections.

Summary

To sum up, the Rathena and noctua latest versions have been released to cran with all the new goodies they bring. As these packages are based on AWS SDK’s they are highly customisable. Features can easily be added to improve the packages when connecting to AWS Athena. So please raise any feature requests / bug issues to: https://github.com/DyfanJones/RAthena and https://github.com/DyfanJones/noctua

To leave a comment for the author, please follow the link and comment on their blog: Dyfan Jones Brain Dump HQ.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.