Apache Drill 1.15.0 + sergeant 0.8.0 = pcapng Support, Proper Column Types & Mounds of New Metadata

hrbrmstr

3 years ago

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.

To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:

CROSS JOIN support
New metadata tables for files, system functiosn, and system functions
pcapng support
enhanced datetime functions
better security
and, web UI improvements

One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL|GH] and make some accommodations for some of these new features.

Ushering In A New Order

Drill REST API queries return a "columns" field and "metadata" field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite to parse the results and there’s no support for 64-bit integers in jsonlite (more on this later).

We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:

library(sergeant) # 0.8.0 branch of sergeant on gitlab or github
library(tidyverse)

con <- src_drill("localhost")

x <- tbl(con, "cp.`employee.json`")

mutate(x, employee_id = as.integer64(employee_id)) %>% 
  mutate(position_id = as.integer64(position_id)) %>% 
  select(
    employee_id, full_name, first_name, last_name, 
    position_id, position_title
  ) -> bigint_result

The above is (logically):

SELECT 
  CAST (employee_id AS INT) AS employee_id,
  full_name,
  first_name, 
  last_name, 
  CAST (position_id AS BIGINT) AS position_id, 
  position_title 
FROM cp.`employee.json`

What do we get when we take a preview of the result?

bigint_result
## # Source: lazy query [?? x 6]
## # Database: DrillConnection
## employee_id full_name first_name last_name position_id position_title
## <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1 Sheri Now… Sheri Nowmer 1 President
## 2 2 Derrick W… Derrick Whelply 2 VP Country Man…
## 3 4 Michael S… Michael Spence 2 VP Country Man…
## 4 5 Maya Guti… Maya Gutierrez 2 VP Country Man…
## 5 6 Roberta D… Roberta Damstra 3 VP Information…
## 6 7 Rebecca K… Rebecca Kanagaki 4 VP Human Resou…
## 7 8 Kim Brunn… Kim Brunner 11 Store Manager
## 8 9 Brenda Bl… Brenda Blumberg 11 Store Manager
## 9 10 Darren St… Darren Stanz 5 VP Finance
## 10 11 Jonathan … Jonathan Murraiin 11 Store Manager
## # ... with more rows

Warning message:
One or more columns are of type BIGINT. The sergeant package currently uses jsonlite::fromJSON()
to process Drill REST API result sets. Since jsonlite does not support 64-bit integers BIGINT
columns are initially converted to numeric since that's how jsonlite::fromJSON() works. This is
problematic for many reasons, including trying to use 'dplyr' idioms with said converted
BIGINT-to-numeric columns. It is recommended that you 'CAST' BIGINT columns to 'VARCHAR' prior to
working with them from R/'dplyr'.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill
with the MapR ODBC drivers.

This informational warning will only be shown once per R session and you can disable them from
appearing by setting the 'sergeant.bigint.warnonce' option to 'FALSE'
(i.e. options(sergeant.bigint.warnonce = FALSE)).

The first thing sergeant users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[] structure). The second thing is that we didn’t get integer64s back. Instead, we got doubles plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce. i.e. just put:

options(sergeant.bigint.warnonce = FALSE)

in your script or ~/.Rprofile and you won’t hear from it again.

The as.integer64() we used is not from the bit64 package but an internal sergeant package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT ).

You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr at the C++-level to provide direct in-package support for BIGINTs as well.

Better Error Messages

Drill query errors that the sergeant package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:

tbl(con, "cp.employees.json")
## # Source:   table<cp.employees.json> [?? x 4]
## # Database: DrillConnection

Warning message:
VALIDATION ERROR: From line 2, column 6 to line 2, column 24: Object 'cp.employees.json' not found

Original Query:

  1: SELECT *
  2: FROM `cp.employees.json`
  3: LIMIT 10

Query Profile Error Link:
http://localhost:8047/profiles/079fc8cf-19c6-4c78-95a9-0b949a3ecf4c

As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.

SUPPORT ALL THE PCAPs!

Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] }, to the "formats" section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).

We’ll use one of the Wireshark example captures to demonstrate:

pcaps <- tbl(con, "dfs.caps.`*.pcapng`")

glimpse(pcaps)
## Observations: ??
## Variables: 25
## $ tcp_flags_ece_ecn_capable            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece_congestion_experienced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_psh                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type                                 <chr> "TCP", "TCP", "TCP", "TCP...
## $ tcp_flags_cwr                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ dst_ip                               <chr> "74.125.28.139", "10.254....
## $ src_ip                               <chr> "10.254.157.208", "74.125...
## $ tcp_flags_fin                        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags                            <int> 17, 17, 16, 16, 16, 0, 0,...
## $ tcp_flags_ack                        <int> 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ src_mac_address                      <chr> "00:05:9A:3C:7A:00", "00:...
## $ tcp_flags_syn                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_rst                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ timestamp                            <dttm> 2015-04-14 07:19:25, 201...
## $ tcp_session                          <dbl> 8.353837e+17, 8.353837e+1...
## $ packet_data                          <chr> "\"3DU...<z...E..(J.@.......
## $ tcp_parsed_flags                     <chr> "ACK|FIN", "ACK|FIN", "AC...
## $ tcp_flags_ns                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ src_port                             <int> 60268, 443, 60268, 58382,...
## $ packet_length                        <int> 54, 54, 54, 55, 66, 78, 7...
## $ tcp_flags_urg                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_ack                              <int> 662445631, 1496589825, 66...
## $ dst_port                             <int> 443, 60268, 443, 29216, 5...
## $ dst_mac_address                      <chr> "00:11:22:33:44:55", "00:...

count(pcaps, src_ip, dst_ip, sort=TRUE)
## # Source:     lazy query [?? x 3]
## # Database:   DrillConnection
## # Groups:     src_ip
## # Ordered by: desc(n)
##    src_ip         dst_ip             n
##    <chr>          <chr>          <dbl>
##  1 10.254.157.208 10.254.158.25    298
##  2 10.254.158.25  10.254.157.208   204
##  3 174.137.42.81  10.254.157.208    76
##  4 10.254.157.208 10.254.158.8      54
##  5 10.254.158.8   10.254.157.208    49
##  6 74.125.28.102  10.254.157.208    49
##  7 10.254.157.208 74.125.28.102     44
##  8 10.254.157.208 174.137.42.81     41
##  9 54.84.98.25    10.254.157.208    25
## 10 157.55.56.168  10.254.157.208    25
## # ... with more rows

More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.

Drill Metadata As Data

Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.

First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:

tbl(con, "information_schema.`schemata`") %>% 
  filter(SCHEMA_NAME == "dfs.caps") %>% 
  print() %>% 
  pull(SCHEMA_NAME) -> pcap_schema
## # Source:   lazy query [?? x 9]
## # Database: DrillConnection
##   CATALOG_NAME SCHEMA_NAME SCHEMA_OWNER TYPE  IS_MUTABLE
##   <chr>        <chr>       <chr>        <chr> <chr>     
## 1 DRILL        dfs.caps    <owner>      file  NO

tbl(con, "information_schema.`files`") %>% 
  filter(schema_name == pcap_schema) %>% 
  glimpse()
## Observations: ??
## Variables: 13
## $ SCHEMA_NAME       <chr> "dfs.caps"
## $ ROOT_SCHEMA_NAME  <chr> "dfs"
## $ WORKSPACE_NAME    <chr> "caps"
## $ FILE_NAME         <chr> "dof-short-capture.pcapng"
## $ RELATIVE_PATH     <chr> "dof-short-capture.pcapng"
## $ IS_DIRECTORY      <lgl> FALSE
## $ IS_FILE           <lgl> TRUE
## $ LENGTH            <dbl> 634280
## $ OWNER             <chr> "hrbrmstr"
## $ GROUP             <chr> "staff"
## $ PERMISSION        <chr> "rw-r--r--"
## $ ACCESS_TIME       <dttm> 1969-12-31 19:00:00
## $ MODIFICATION_TIME <dttm> 2019-01-01 19:12:17

The Drill system options table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts() and drill_functions() return a data frame of all this info and have an optional browse parameter which, if set to TRUE, will show a DT interactive data table for them. I find this especially handy when I forget something like regexp_like syntax (I use alot of back-ends and many are wildly different) and can now do this:

FIN

Keep on the lookout for the rapidjsonr/BIGINT integration and more new features of the sergeant package. NOTE: The better error messages have been ported over to the sergeant.caffeinated package (the RJDBC interface) and the other niceties will make their way into that package soon as well.

So, make sure you’re using the 0.8.0 GL / GH, kick the tyres, file issues where you’re most comfortable working.

May your queries all be optimized and results sets complete in the new year!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.