AWS Athena dplyr unload
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RBloggers|RBloggers-feedburner
Intro:
This is a quick update on the latest features for RAthena
and noctua
2.4.0.
Latest features:
dbplyr
:
RAthena
and noctua
now fully supports dbplyr
backend api 2+. dplyr
database generics will be deprecated in later versions of the dbplyr
package development. This is to future proof RAthena
and noctua
, while keeping the same functionality developed for dbplyr
backend api version 1.
dplyr
and unload:
RAthena
and noctua
can now set AWS Athena Unload
on a session level. This enables dplyr syntax to leverage AWS Athena unload
without any extra code.
Note: Set up AWS Athena example table using AWS Athena awswrangler example:
# Python import awswrangler as wr import getpass bucket = getpass.getpass() path = f"s3://{bucket}/data/" if "awswrangler_test" not in wr.catalog.databases().values: wr.catalog.create_database("awswrangler_test") cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"] df = wr.s3.read_csv( path="s3://noaa-ghcn-pds/csv/189", names=cols, parse_dates=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB) wr.s3.to_parquet( df=df, path=path, dataset=True, mode="overwrite", database="awswrangler_test", table="noaa" ); wr.catalog.table(database="awswrangler_test", table="noaa")
Set up connection to AWS Athena
.
library(DBI) library(RAthena) # or library(noctua) library(dplyr, warn.conflicts = F) con <- dbConnect(athena())
Query AWS Athena
table without using AWS Athena unload
method.
noaa_tbl = tbl(con, dbplyr::in_schema("awswrangler_test","noaa")) system.time({noaa = collect(noaa_tbl)}) #> Info: (Data scanned: 80.86 MB) #> user system elapsed #> 67.144 7.353 126.914 dim(noaa) #> 29554220 8
Query AWS Athena
table using AWS Athena unload
method.
RAthena_options(unload = TRUE) # or noctua_options(unload = TRUE) for noctua system.time({noaa = collect(noaa_tbl)}) #> Info: (Data scanned: 80.86 MB) #> user system elapsed #> 11.062 2.317 32.078 dim(noaa) #> 29554220 8
Query AWS Athena
table using AWS Athena unload
method, while caching.
RAthena_options(cache_size = 10, unload = TRUE) # or noctua_options(cache_size = 10, unload = TRUE) for noctua system.time({noaa = collect(noaa_tbl)}) #> Info: (Data scanned: 80.86 MB) #> user system elapsed #> 10.768 2.541 12.305 dim(noaa) #> 29554220 8
Benchmark ran on AWS Sagemaker
NOTE: Cache speeds will only benefit repeat queries!
AWS Athena Unload
method doesn’t work on every sql
query type. As a result, several dplyr
methods won’t work when unload is set to True
, for example:
noaa_tbl %>% filter(element == "TMAX") %>% compute() #> Error: Unable to create table when `RAthena_options(unload = TRUE)`. Please run `RAthena_options(unload = FALSE)` and try again.
In general AWS Athena unload
method is a nice way to speed up some queries when working with RAthena
and noctua
when the data returning is large.
Finally
If there is any new features or bug fixes please raise them at RAthena
issues or noctua
issues.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.