ggplot “Doodling” with HIBP Breaches
[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.
It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.
You can see a full-size version of the image via this link.
library(hrbrthemes) # use github or gitlab version library(tidyverse) # get the data dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json" jsonlite::fromJSON(dat_url) %>% mutate(BreachDate = as.Date(BreachDate)) %>% tbl_df() -> breaches # selected breach labels df group_by(breaches, year = lubridate::year(BreachDate)) %>% top_n(1, wt=PwnCount) %>% ungroup() %>% filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely mutate( lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000)) ) %>% arrange(year) -> labs # num of known breaches in that year for labels count(breaches, year = lubridate::year(BreachDate)) %>% mutate(nlab = sprintf("n=%s", n)) %>% mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts mutate(breaches, p_half = PwnCount/2) %>% # for centered segments ggplot() + geom_segment( # centered segments aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), color = ft_cols$yellow, size = 0.3 ) + geom_text( # selected breach labels data = labs, aes(BreachDate, PwnCount/2, label = lab), lineheight = 0.875, size = 3.25, family = font_rc, hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25), nudge_y = 0, color = ft_cols$slate ) + geom_text( # top year labels data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray ) + geom_text( # bottom known breach count totals data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), vjust = 0, lineheight = 0.875, color = ft_cols$peach, family = font_rc, show.legend = FALSE ) + scale_x_date( # break on year name = NULL, date_breaks = "1 year", date_labels = "%Y" ) + scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes labs( title = "HIBP (Known) Breach Frequency & Size", subtitle = "Segment length is number of accounts; n=# known account breaches that year", caption = "Source: HIBP via " ) + theme_ft_rc(grid="X") + theme(axis.text.y = element_blank()) + theme(axis.text.x = element_blank())
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.