On the Road to 0.8.0 — Some Additional New Features Coming in the sergeant Package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It was probably not difficult to discern from my previous Drill-themed post that I’m fairly excited about the Apache Drill 1.15.0 release. I’ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.
Drill Docker Goodness
Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so you’re not going to be running this in “production” (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you haven’t already played with it since you don’t have do do much except run the Docker image.
I’ve simplified this even more thanks to @rgfitzjohn’s most excellent stevedore
package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate
. The new drill_up()
function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.
Just running the vanilla image isn’t enough since your goal is likely to do more than work with the built-in cp
data source. The default container launch scenario also doesn’t hook up any local filesystem paths to the container so you really can’t do much other than cp
-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up()
function provides a data_dir
argument (that defaults to the getwd()
where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d
workspace which auto-points to it for you. Here’s a sample execution:
library(sergeant) library(tidyverse) dr <- drill_up(data_dir = "~/Data") ## Drill container started. Waiting for the service to become active (this may take up to 30s). ## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0
You can use dc$stop()
to stop the container or use the printed container id to do it from the command line.
We’ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!)
Taking the Tedium out of CTAS
@dseverski, an intrepid R, Drill & sergeant
user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements.
Prior to 1.14.0 sergeant
had no way to accurately, precisely tell data types of the columns coming back since the REST API didn’t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something **kinda horribad*: it ran readr::type_convert()
on the result sets . Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didn’t seem to happen).
With 1.15.0, the readr::type_convert()
crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data:
# using the Drill container we just started above write_csv(nycflights13::flights, "~/Data/flights.csvh") con <- src_drill("localhost") tbl(con, "dfs.d.`flights.csvh`") %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ## $ year <chr> "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013… ## $ month <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ## $ day <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "… ## $ dep_time <chr> "517", "533", "542", "544", "554", "554", "555", "557", "557"… ## $ sched_dep_time <chr> "515", "529", "540", "545", "600", "558", "600", "600", "600"… ## $ dep_delay <chr> "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"… ## $ arr_time <chr> "830", "850", "923", "1004", "812", "740", "913", "709", "838… ## $ sched_arr_time <chr> "819", "830", "850", "1022", "837", "728", "854", "723", "846… ## $ arr_delay <chr> "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",… ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ## $ flight <chr> "1545", "1714", "1141", "725", "461", "1696", "507", "5708", … ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ## $ air_time <chr> "227", "227", "160", "183", "116", "150", "158", "53", "140",… ## $ distance <chr> "1400", "1416", "1089", "1576", "762", "719", "1065", "229", … ## $ hour <chr> "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "… ## $ minute <chr> "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "… ## $ time_hour <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1…
So the package does what it finally should have been doing all along. But, as noted, that’s not great if you just wanted to quickly work with a directory of CSV files. In theory, you’re supposed to use Drill’s CREATE TABLE AS
then do a bunch of CASTS
and TO_
s to get proper data types. But who has time for that?
David had a stellar idea, might sergeant
be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile()
function.
Let’s pipe the previous tbl()
into ctas_profile()
and see what we get:
tbl(con, "dfs.d.`flights.csvh`") %>% ctas_profile() %>% cat() -- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 ** CREATE TABLE CHANGE____ME AS SELECT CAST(`year` AS DOUBLE) AS `year`, CAST(`month` AS DOUBLE) AS `month`, CAST(`day` AS DOUBLE) AS `day`, CAST(`dep_time` AS DOUBLE) AS `dep_time`, CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`, CAST(`dep_delay` AS DOUBLE) AS `dep_delay`, CAST(`arr_time` AS DOUBLE) AS `arr_time`, CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`, CAST(`arr_delay` AS DOUBLE) AS `arr_delay`, CAST(`carrier` AS VARCHAR) AS `carrier`, CAST(`flight` AS DOUBLE) AS `flight`, CAST(`tailnum` AS VARCHAR) AS `tailnum`, CAST(`origin` AS VARCHAR) AS `origin`, CAST(`dest` AS VARCHAR) AS `dest`, CAST(`air_time` AS DOUBLE) AS `air_time`, CAST(`distance` AS DOUBLE) AS `distance`, CAST(`hour` AS DOUBLE) AS `hour`, CAST(`minute` AS DOUBLE) AS `minute`, TO_TIMESTAMP(`time_hour`, 'FORMATSTRING') AS `time_hour` -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.`flights.csvh`) -- TIMESTAMP and/or DATE columns were detected. Drill's date/time format string reference can be found at: -- -- <http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html>
There’s a parameter for the new table name which will cause the CHANGE____ME
to go away and when the function finds TIMESTAMP
or DATE
fields it knows to switch to their TO_
cousins and gives sample data with a reminder that you need to make a format string (I’ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (they’re different than what you’re used to) it provides a handy link to them if it detects the presence of those column types.
Now, we don’t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl()
:
tbl(con, sql(" SELECT CAST(`year` AS DOUBLE) AS `year`, CAST(`month` AS DOUBLE) AS `month`, CAST(`day` AS DOUBLE) AS `day`, CAST(`dep_time` AS DOUBLE) AS `dep_time`, CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`, CAST(`dep_delay` AS DOUBLE) AS `dep_delay`, CAST(`arr_time` AS DOUBLE) AS `arr_time`, CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`, CAST(`arr_delay` AS DOUBLE) AS `arr_delay`, CAST(`carrier` AS VARCHAR) AS `carrier`, CAST(`flight` AS DOUBLE) AS `flight`, CAST(`tailnum` AS VARCHAR) AS `tailnum`, CAST(`origin` AS VARCHAR) AS `origin`, CAST(`dest` AS VARCHAR) AS `dest`, CAST(`air_time` AS DOUBLE) AS `air_time`, CAST(`distance` AS DOUBLE) AS `distance`, CAST(`hour` AS DOUBLE) AS `hour`, CAST(`minute` AS DOUBLE) AS `minute`, TO_TIMESTAMP(`time_hour`, 'yyyy-MM-dd''T''HH:mm:ssZ') AS `time_hour` -- [2013-01-01T10:00:00Z]. FROM (SELECT * FROM dfs.d.`flights.csvh`) ")) %>% glimpse() ## Observations: ?? ## Variables: 19 ## Database: DrillConnection ## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time <dbl> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5… ## $ sched_dep_time <dbl> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6… ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -… ## $ arr_time <dbl> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, … ## $ sched_arr_time <dbl> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, … ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,… ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "… ## $ flight <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,… ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"… ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34… ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, … ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6… ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0… ## $ time_hour <dttm> 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0…
Ahhhh… Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.)
FIN
As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though y’all still seem to be stuck on GH ).
NOTE: I’ll be subbing out most install_github()
links in READMEs and future blog posts for install_git()
counterparts pointing to my sr.ht repos (as I co-locate/migrate them there).
You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0")
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.