R⁶ Series — Random Sampling From Apache Drill Tables With R & sergeant
[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)
At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.
Here’s how to uniformly sample data from Apache Drill using the sergeant
package:
library(sergeant) db <- src_drill("sonar") tbl <- tbl(db, "dfs.dns.`aaaa.parquet`") summarise(tbl, n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## <int> ## 1 19977415 mutate(tbl, r=rand()) %>% filter(r <= 0.01) %>% summarise(n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## <int> ## 1 199808 mutate(tbl, r=rand()) %>% filter(r <= 0.50) %>% summarise(n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## <int> ## 1 9988797
And, for groups (using a different/larger “database”):
fdns <- tbl(db, "dfs.fdns.`201708`") summarise(fdns, n=n()) ## # Source: lazy query [?? x 1] ## # Database: DrillConnection ## n ## <int> ## 1 1895133100 filter(fdns, type %in% c("cname", "txt")) %>% count(type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## <chr> <int> ## 1 cname 15389064 ## 2 txt 67576750 filter(fdns, type %in% c("cname", "txt")) %>% group_by(type) %>% mutate(r=rand()) %>% ungroup() %>% filter(r <= 0.15) %>% count(type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## <chr> <int> ## 1 cname 2307604 ## 2 txt 10132672
I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.