Reading PCAP Files with Apache Drill and the sergeant R Package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant
package that rests atop d[b]plyr
. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.
Why am I calling this out?
Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:
sudo tcpdump -ni en0 -s0 -w capture01.pcap
And, wait a bit.
NOTE: Some of you may have to change the en0
to your main network interface name (a quick google for that for your platform should get you to the right one to use).
That command will passively record all network activity on your system until you ctrl-c
it. The longer it goes the larger it gets.
When you’ve recorded a minute or two of packets, ctrl-c
the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump
or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter
to work with these files but my use case/project dried up and haven’t gone back to it.
Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:
library(sergeant) library(iptools) library(tidyverse) library(cymruservices) db <- src_drill("localhost") my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`") glimpse(my_pcaps) ## Observations: 25 ## Variables: 12 ## $ src_ip <chr> "192.168.10.100", "54.159.166.81", "192.168.10... ## $ src_port <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535... ## $ tcp_session <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -... ## $ packet_length <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75... ## $ data <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=......... ## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:... ## $ dst_port <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,... ## $ type <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP... ## $ dst_ip <chr> "54.159.166.81", "192.168.10.100", "54.159.166... ## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:... ## $ network <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1... ## $ timestamp <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201... summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% collect() %>% summarise(max - min) ## # A tibble: 1 x 1 ## `max - min` ## <time> ## 1 1.924583 mins count(my_pcaps, type) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## type n ## <chr> <int> ## 1 TCP 4974 ## 2 UDP 774 filter(my_pcaps, type=="TCP") %>% count(dst_port, sort=TRUE) ## # Source: lazy query [?? x 2] ## # Database: DrillConnection ## # Ordered by: desc(n) ## dst_port n ## <int> <int> ## 1 443 2580 ## 2 56202 476 ## 3 56229 226 ## 4 56147 169 ## 5 56215 103 ## 6 56143 94 ## 7 56085 89 ## 8 56203 56 ## 9 56205 39 ## 10 56209 39 ## # ... with more rows filter(my_pcaps, type=="TCP") %>% count(dst_ip, sort=TRUE) %>% collect() -> dst_ips filter(dst_ips, !is.na(dst_ip)) %>% left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")), by = c("dst_ip"="ips")) %>% filter(!in_cidr) %>% left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>% select(dst_ip, n, as_name) ## # A tibble: 37 x 3 ## dst_ip n as_name ## <chr> <int> <chr> ## 1 104.244.42.2 862 TWITTER - Twitter Inc., US ## 2 104.244.46.103 556 TWITTER - Twitter Inc., US ## 3 104.20.60.241 183 CLOUDFLARENET - CloudFlare, Inc., US ## 4 31.13.80.8 160 FACEBOOK - Facebook, Inc., US ## 5 52.218.160.76 100 AMAZON-02 - Amazon.com, Inc., US ## 6 104.20.59.241 79 CLOUDFLARENET - CloudFlare, Inc., US ## 7 52.218.160.92 66 AMAZON-02 - Amazon.com, Inc., US ## 8 199.16.156.81 58 TWITTER - Twitter Inc., US ## 9 104.244.42.193 47 TWITTER - Twitter Inc., US ## 10 52.86.113.212 42 AMAZON-AES - Amazon.com, Inc., US ## # ... with 27 more rows
No custom R code. No modification to the sergeant
package. Just query it like any other data source.
One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.
I had to add "pcap": { "type": "pcap" }
to the formats
section of the dfs
storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap
type. But after that, it just works.
Well, kinda.
The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant
package (for now) can’t do much with the data
component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.
Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant
side to help make the data
element more useful.
Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.
If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.