Making World Tile Grid-Grids

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A colleague asked if I would blog about how I crafted the grid of world tile grids in this post and I accepted the challenge. The technique isn’t too hard as it just builds on the initial work by Jon Schwabish and a handy file made by Maarten Lambrechts.

The Premise

For this particular use-case, I sifted through our internet scan data and classified a series of device families from their telnet banners then paired that with our country-level attribution data for each IPv4 address. I’m not generally “a fan” of rolling things up at a country level, but since many (most) of these devices are residential or small/medium-business routers, country-level attribution has some merit.

But, I’m also not a fan of country-level choropleths when it comes to “cyber” nor am I wont to area-skewed cartograms since most folks still cannot interpret them. Both of those take up a ton of screen real estate, too, espeically if you have more than one of them. Yet, I wanted to show a map-like structure without resorting to Hilbert IPv4 heatmaps since they are neither very readable by a general audience and become skewed when you have to move up from a 1 pixel == 1 Class C network block.

I think the tile grid is a great compromise since it avoids the “area”and projection skewness confusion that regular global choropleths cause while still preserving geographic & positional proximity. Sure, they’ll take some getting used to by casual readers, but I felt it was the best of all the tradeoffs.

The Setup

Here’s the data:

library(here)
library(hrbrthemes)
library(tidyverse)

wtg <- read_csv("https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv")

glimpse(wtg)
 
## Observations: 192
## Variables: 11
## $ name            <chr> "Afghanistan", "Albania", "Algeria", "Angola",...
## $ alpha.2         <chr> "AF", "AL", "DZ", "AO", "AQ", "AG", "AR", "AM"...
## $ alpha.3         <chr> "AFG", "ALB", "DZA", "AGO", "ATA", "ATG", "ARG...
## $ country.code    <chr> "004", "008", "012", "024", "010", "028", "032...
## $ iso_3166.2      <chr> "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:...
## $ region          <chr> "Asia", "Europe", "Africa", "Africa", "Antarct...
## $ sub.region      <chr> "Southern Asia", "Southern Europe", "Northern ...
## $ region.code     <chr> "142", "150", "002", "002", NA, "019", "019", ...
## $ sub.region.code <chr> "034", "039", "015", "017", NA, "029", "005", ...
## $ x               <int> 22, 15, 13, 13, 15, 7, 6, 20, 24, 15, 21, 4, 2...
## $ y               <int> 8, 9, 11, 17, 23, 4, 14, 6, 19, 6, 7, 2, 9, 8,...

routers <- read_csv(here::here("data", "routers.csv"))

routers
 
## # A tibble: 453,027 x 3
##    type     country_name           country_code
##    <chr>    <chr>                  <chr>       
##  1 mikrotik Slovak Republic        SK          
##  2 mikrotik Czechia                CZ          
##  3 mikrotik Colombia               CO          
##  4 mikrotik Bosnia and Herzegovina BA          
##  5 mikrotik Czechia                CZ          
##  6 mikrotik Brazil                 BR          
##  7 mikrotik Vietnam                VN          
##  8 mikrotik Brazil                 BR          
##  9 mikrotik India                  IN          
## 10 mikrotik Brazil                 BR          
## # ... with 453,017 more rows

distinct(routers, type) %>% 
  arrange(type) %>% 
  print(n=11)
 
## # A tibble: 11 x 1
##    type    
##    <chr>   
##  1 asus    
##  2 dlink   
##  3 huawei  
##  4 linksys 
##  5 mikrotik
##  6 netgear 
##  7 qnap    
##  8 tplink  
##  9 ubiquiti
## 10 upvel   
## 11 zte

So, we have 11 different device families under assault by “VPNFilter”
and I wanted to show the global distribution of them. Knowing the
compact world tile grid would facet well, I set off to make it happen.

Let’s get some decent names for facet labels:

real_names <- read_csv(here::here("data", "real_names.csv"))

real_names
 
## # A tibble: 11 x 2
##    type     lab             
##    <chr>    <chr>           
##  1 asus     Asus Device     
##  2 dlink    D-Link Devices  
##  3 huawei   Huawei Devices  
##  4 linksys  Linksys Devices 
##  5 mikrotik Mikrotik Devices
##  6 netgear  Netgear Devices 
##  7 qnap     QNAP Devices    
##  8 tplink   TP-Link Devices 
##  9 ubiquiti Ubiquiti Devices
## 10 upvel    Upvel Devices   
## 11 zte      ZTE Devices

Next, we need to summarise our scan results and pair it up the world
tile grid data and our real names:

count(routers, country_code, type) %>%  # summarise the data into # of device familes per country
  left_join(wtg, by = c("country_code" = "alpha.2")) %>% # join them up on the common field
  filter(!is.na(alpha.3)) %>% # we only want countries on the grid and maxmind attributes some things to meta-regions and anonymous proxies
  left_join(real_names) -> wtg_routers

glimpse(wtg_routers)

## Observations: 629
## Variables: 14
## $ country_code    <chr> "AE", "AE", "AE", "AF", "AF", "AF", "AG", "AL"...
## $ type            <chr> "asus", "huawei", "mikrotik", "huawei", "mikro...
## $ n               <int> 1, 12, 70, 12, 264, 27, 1, 941, 2081, 7, 2, 1,...
## $ name            <chr> "United Arab Emirates", "United Arab Emirates"...
## $ alpha.3         <chr> "ARE", "ARE", "ARE", "AFG", "AFG", "AFG", "ATG...
## $ country.code    <chr> "784", "784", "784", "004", "004", "004", "028...
## $ iso_3166.2      <chr> "ISO 3166-2:AE", "ISO 3166-2:AE", "ISO 3166-2:...
## $ region          <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia"...
## $ sub.region      <chr> "Western Asia", "Western Asia", "Western Asia"...
## $ region.code     <chr> "142", "142", "142", "142", "142", "142", "019...
## $ sub.region.code <chr> "145", "145", "145", "034", "034", "034", "029...
## $ x               <int> 20, 20, 20, 22, 22, 22, 7, 15, 15, 15, 20, 20,...
## $ y               <int> 10, 10, 10, 8, 8, 8, 4, 9, 9, 9, 6, 6, 6, 6, 1...
## $ lab             <chr> "Asus Device", "Huawei Devices", "Mikrotik Dev...

Then, plot it:

ggplot(wtg_routers, aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

Doh! We forgot to ensure we had data for every country. Let’s try that
again:

count(routers, country_code, type) %>%
  complete(country_code, type) %>%
  filter(!is.na(country_code)) %>%
  left_join(wtg, c("country_code" = "alpha.2")) %>%
  filter(!is.na(alpha.3)) %>%
  left_join(real_names) %>%
  complete(country_code, type, x=unique(wtg$x), y=unique(wtg$y)) %>%
  filter(!is.na(lab)) %>%
  ggplot(aes(x, y, fill=n, group=lab)) +
  geom_tile(color="#b2b2b2", size=0.125) +
  scale_y_reverse() +
  viridis::scale_fill_viridis(name="# Devices", trans="log10", na.value="white", label=scales::comma) +
  facet_wrap(~lab, ncol=3) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "World Tile Grid Per-country Concentration of\nSeriously Poorly Configured Network Devices",
    subtitle = "Device discovery based on in-scope 'VPNFilter' vendor device banner strings",
    caption = "Source: Rapid7 Project Sonar & Censys"
  ) +
  theme_ipsum_rc(grid="") +
  theme(panel.background = element_rect(fill="#969696", color="#969696")) +
  theme(axis.text=element_blank()) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width = unit(2, "lines")) +
  theme(legend.position=c(0.85, 0.1))

 

That’s better.

We take advantage of ggplot2’s ability to facet and just ensure we have
complete (even if NA) tiles for each panel.

Once consumers start seeing these used more they’ll be able to pick up
key markers (or one of us will come up with a notation that makes key
markers more visible) and be able to get specific information from the
chart. I just wanted to show regional and global differences between
vendors (and really give MikroTik users a swift kick in the patootie for
being so bad with their kit).

FIN

You can find the RStudio project (code + data) here: (http://rud.is/dl/tile-grid-grid.zip)

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)