AWS Athena dplyr unload
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
RBloggers|RBloggers-feedburner
Intro:
This is a quick update on the latest features for RAthena and noctua 2.4.0.
Latest features:
dbplyr:
RAthena and noctua now fully supports dbplyr backend api 2+. dplyr database generics will be deprecated in later versions of the dbplyr package development. This is to future proof RAthena and noctua, while keeping the same functionality developed for dbplyr backend api version 1.
dplyr and unload:
RAthena and noctua can now set AWS Athena Unload on a session level. This enables dplyr syntax to leverage AWS Athena unload without any extra code.
Note: Set up AWS Athena example table using AWS Athena awswrangler example:
# Python
import awswrangler as wr
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"
if "awswrangler_test" not in wr.catalog.databases().values:
wr.catalog.create_database("awswrangler_test")
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
df = wr.s3.read_csv(
path="s3://noaa-ghcn-pds/csv/189",
names=cols,
parse_dates=["dt", "obs_time"]) # Read 10 files from the 1890 decade (~1GB)
wr.s3.to_parquet(
df=df,
path=path,
dataset=True,
mode="overwrite",
database="awswrangler_test",
table="noaa"
);
wr.catalog.table(database="awswrangler_test", table="noaa")
Set up connection to AWS Athena.
library(DBI) library(RAthena) # or library(noctua) library(dplyr, warn.conflicts = F) con <- dbConnect(athena())
Query AWS Athena table without using AWS Athena unload method.
noaa_tbl = tbl(con, dbplyr::in_schema("awswrangler_test","noaa"))
system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#> user system elapsed
#> 67.144 7.353 126.914
dim(noaa)
#> 29554220 8
Query AWS Athena table using AWS Athena unload method.
RAthena_options(unload = TRUE) # or noctua_options(unload = TRUE) for noctua
system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#> user system elapsed
#> 11.062 2.317 32.078
dim(noaa)
#> 29554220 8
Query AWS Athena table using AWS Athena unload method, while caching.
RAthena_options(cache_size = 10, unload = TRUE) # or noctua_options(cache_size = 10, unload = TRUE) for noctua
system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#> user system elapsed
#> 10.768 2.541 12.305
dim(noaa)
#> 29554220 8
Benchmark ran on AWS Sagemaker
NOTE: Cache speeds will only benefit repeat queries!
AWS Athena Unload method doesn’t work on every sql query type. As a result, several dplyr methods won’t work when unload is set to True, for example:
noaa_tbl %>% filter(element == "TMAX") %>% compute() #> Error: Unable to create table when `RAthena_options(unload = TRUE)`. Please run `RAthena_options(unload = FALSE)` and try again.
In general AWS Athena unload method is a nice way to speed up some queries when working with RAthena and noctua when the data returning is large.
Finally
If there is any new features or bug fixes please raise them at RAthena issues or noctua issues.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.