Handling Semantic Version Strings Like a Boss with the semver Package

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I work with internet-scale data and do my fair share of macro-analyses on vulnerabilities. I use the R semver package for most of my work and wanted to blather on a bit about it since it’s super-helpful for this work and doesn’t get the attention it deserves. semver makes it possible to create charts like this:

which are very helpful in when conducting exposure analytics.

We’ll need a few packages to help us along the way:

library(here) # file mgmt
library(semver) # the whole purpose of the blog post
library(rvest) # we'll need this to get version->year mappings
library(stringi) # b/c I'm still too lazy to switch to ore
library(hrbrthemes) # pretty graphs
library(tidyverse) # sane data processing idioms

By issuing a stats command to a memcached instance you can get a full list of statistics for the server. The recent newsmaking DDoS used this feature in conjunction with address spoofing to create 30 minutes of chaos for GitHub.

I sent a stats command (followed by a newline) to a vanilla memcached installation and it returned 53 lines (1108 bytes) of STAT results that look something like this:

STAT pid 7646
STAT uptime 141
STAT time 1520447469
STAT version 1.4.25 Ubuntu
STAT libevent 2.0.21-stable
...

The version bit is what we’re after, but there are plenty of other variables you could just as easily focus on if you use memcached in any production capacity.

I extracted raw version response data from our most recent scan for open memcached servers on the internet. For ethical reasons, I cannot blindly share the entire raw data set but hit up [email protected] if you have a need or desire to work with this data.

Let’s read it in and take a look:

version_strings <- read_lines(here("data", "versions.txt"))

set.seed(2018-03-07)

sample(version_strings, 50)

##  [1] "STAT version 1.4.5"             "STAT version 1.4.17"           
##  [3] "STAT version 1.4.25"            "STAT version 1.4.31"           
##  [5] "STAT version 1.4.25"            "STAT version 1.2.6"            
##  [7] "STAT version 1.2.6"             "STAT version 1.4.15"           
##  [9] "STAT version 1.4.17"            "STAT version 1.4.4"            
## [11] "STAT version 1.4.5"             "STAT version 1.2.6"            
## [13] "STAT version 1.4.2"             "STAT version 1.4.14 (Ubuntu)"  
## [15] "STAT version 1.4.7"             "STAT version 1.4.39"           
## [17] "STAT version 1.4.4-14-g9c660c0" "STAT version 1.2.6"            
## [19] "STAT version 1.2.6"             "STAT version 1.4.14"           
## [21] "STAT version 1.4.4-14-g9c660c0" "STAT version 1.4.37"           
## [23] "STAT version 1.4.13"            "STAT version 1.4.4"            
## [25] "STAT version 1.4.17"            "STAT version 1.2.6"            
## [27] "STAT version 1.4.37"            "STAT version 1.4.13"           
## [29] "STAT version 1.4.25"            "STAT version 1.4.15"           
## [31] "STAT version 1.4.25"            "STAT version 1.2.6"            
## [33] "STAT version 1.4.10"            "STAT version 1.4.25"           
## [35] "STAT version 1.4.25"            "STAT version 1.4.9"            
## [37] "STAT version 1.4.30"            "STAT version 1.4.21"           
## [39] "STAT version 1.4.15"            "STAT version 1.4.31"           
## [41] "STAT version 1.4.13"            "STAT version 1.2.6"            
## [43] "STAT version 1.4.13"            "STAT version 1.4.15"           
## [45] "STAT version 1.4.19"            "STAT version 1.4.25 Ubuntu"    
## [47] "STAT version 1.4.37"            "STAT version 1.4.4-14-g9c660c0"
## [49] "STAT version 1.2.6"             "STAT version 1.4.25 Ubuntu"

It’s in decent shape, but it needs some work if we’re going to do a version analysis with it. Let’s clean it up a bit:

data_frame(
  string = stri_match_first_regex(version_strings, "STAT version (.*)$")[,2]
) -> versions

count(versions, string, sort = TRUE) %>%
  knitr::kable(format="markdown")
string n
1.4.15 1966
1.2.6 1764
1.4.17 1101
1.4.37 949
1.4.13 725
1.4.4 531
1.4.25 511
1.4.20 368
1.4.14 (Ubuntu) 334
1.4.21 309
1.4.25 Ubuntu 290
1.4.24 259

Much better! However, we really only need the major parts of the semantic version string for a macro view, so let’s remove non-version strings completely and extract just the major, minor and patch bits:

filter(versions, !stri_detect_fixed(string, "UNKNOWN")) %>% # get rid of things we can't use
  mutate(string = stri_match_first_regex(
    string, "([[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+)")[,2] # for a macro-view, the discrete sub-versions aren't important
  ) -> versions

count(versions, string, sort = TRUE) %>%
  knitr::kable(format="markdown")
string n
1.4.15 1966
1.2.6 1764
1.4.17 1101
1.4.37 949
1.4.25 801
1.4.4 747
1.4.13 727
1.4.14 385
1.4.20 368
1.4.21 309
1.4.24 264

Much, much better! Now, let’s dig into the versions a bit. Using semver is dirt-simple. Just use parse_version() to get the usable bits out:

ex_ver <- semver::parse_version(head(versions$string[1]))

ex_ver
## [1] Maj: 1 Min: 4 Pat: 25

str(ex_ver)
## List of 1
##  $ :Class 'svptr' <externalptr> 
##  - attr(*, "class")= chr "svlist"

It’s a special class, referencing an external pointer (the package relies on an underling C++ library and wraps everything up in a bow for us).

These objects can be compared, ordered, sorted, etc but I tend to just turn the parsed versions into a data frame that can be associated back with the main strings. That way we keep things pretty tidy and have tons of flexibility.

bind_cols(
  versions,
  pull(versions, string) %>%
    semver::parse_version() %>%
    as.data.frame()
) %>%
  arrange(major, minor, patch) %>%
  mutate(string = factor(string, levels = unique(string))) -> versions

versions
## # A tibble: 11,157 x 6
##    string major minor patch prerelease build
##    <fct>  <int> <int> <int> <chr>      <chr>
##  1 1.2.0      1     2     0 ""         ""   
##  2 1.2.0      1     2     0 ""         ""   
##  3 1.2.5      1     2     5 ""         ""   
##  4 1.2.5      1     2     5 ""         ""   
##  5 1.2.5      1     2     5 ""         ""   
##  6 1.2.5      1     2     5 ""         ""   
##  7 1.2.5      1     2     5 ""         ""   
##  8 1.2.5      1     2     5 ""         ""   
##  9 1.2.5      1     2     5 ""         ""   
## 10 1.2.5      1     2     5 ""         ""   
## # ... with 11,147 more rows

Now we have a tidy data frame and I did the extra step of creating an ordered factor out of the version strings since they are ordinal values. With just this step, we have everything we need to do a basic plot shoing the version counts in-order:

count(versions, string) %>%
  ggplot() +
  geom_segment(
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  scale_y_comma() +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

memcached versions (raw)

That chart is informative on its own since we get the perspective that there are some really old versions exposed. But, how old are they? Projects like Chrome or Firefox churn through versions regularly/quickly (on purpose). To make more sense out of this we’ll need more info on releases.

This is where things can get ugly for folks who do not have commercial software management databases handy (or are analyzing a piece of software that hasn’t made it to one of those databases yet). The memcached project maintains a wiki page of version history that’s mostly complete, and definitely complete enough for this exercise. It will some processing before we can associate a version to a year.

GitHub does not allow scraping of their site and — off the top of my head — I do not know if there is a “wiki” API endpoint, but I do know that you can tack on .wiki.git to the end of a GitHub repo to clone the wiki pages, so we’ll use that knowledge and the git2r package to gain access to the
ReleaseNotes.md file that has the data we need:

td <- tempfile("wiki", fileext="git") # temporary "directory"

dir.create(td)

git2r::clone(
  url = "[email protected]:memcached/memcached.wiki.git",
  local_path = td,
  credentials = git2r::cred_ssh_key() # need GH ssh keys setup!
) -> repo
## cloning into '/var/folders/1w/2d82v7ts3gs98tc6v772h8s40000gp/T//Rtmpb209Sk/wiki180eb3c6addcbgit'...
## Receiving objects:   1% (5/481),    8 kb
## Receiving objects:  11% (53/481),    8 kb
## Receiving objects:  21% (102/481),   49 kb
## Receiving objects:  31% (150/481),   81 kb
## Receiving objects:  41% (198/481),  113 kb
## Receiving objects:  51% (246/481),  177 kb
## Receiving objects:  61% (294/481),  177 kb
## Receiving objects:  71% (342/481),  192 kb
## Receiving objects:  81% (390/481),  192 kb
## Receiving objects:  91% (438/481),  192 kb
## Receiving objects: 100% (481/481),  192 kb, done.

read_lines(file.path(repo@path, "ReleaseNotes.md")) %>%
  keep(stri_detect_fixed, "[[ReleaseNotes") %>%
  stri_replace_first_regex(" \\* \\[\\[.*]] ", "") %>%
  stri_split_fixed(" ", 2, simplify = TRUE) %>%
  as_data_frame() %>%
  set_names(c("string", "release_year")) %>%
  mutate(string = stri_trim_both(string)) %>%
  mutate(release_year = stri_replace_first_fixed(release_year, "(", "")) %>% # remove leading parens
  mutate(release_year = stri_replace_all_regex(release_year, "\\-.*$", "")) %>% # we only want year so remove remaining date bits from easy ones
  mutate(release_year = stri_replace_all_regex(release_year, "^.*, ", "")) %>% # take care of most of the rest of the ugly ones
  mutate(release_year = stri_replace_all_regex(release_year, "^[[:alpha:]].* ", "")) %>% # take care of the straggler
  mutate(release_year = stri_replace_last_fixed(release_year, ")", "")) %>% # remove any trailing parens
  mutate(release_year = as.numeric(release_year)) -> memcached_releases # make it numeric

unlink(td, recursive = TRUE) # cleanup the git repo we downloaded

memcached_releases
## # A tibble: 49 x 2
##    string release_year
##    <chr>         <dbl>
##  1 1.5.6          2018
##  2 1.5.5          2018
##  3 1.5.4          2017
##  4 1.5.3          2017
##  5 1.5.2          2017
##  6 1.5.1          2017
##  7 1.5.0          2017
##  8 1.4.39         2017
##  9 1.4.38         2017
## 10 1.4.37         2017
## # ... with 39 more rows

We have more versions in our internet-scraped memcached versions data
set than this wiki page has on it, so we need to restrict the official
release history to what we have. Then, we only want a single instance of
each year for the annotations, so we’ll have to do some further processing:

filter(memcached_releases, string %in% unique(versions$string)) %>%
  mutate(string = factor(string, levels = levels(versions$string))) %>%
  group_by(release_year) %>%
  arrange(desc(string)) %>%
  slice(1) %>%
  ungroup() -> annotation_df

knitr::kable(annotation_df, "markdown")
string release_year
1.4.4 2009
1.4.5 2010
1.4.10 2011
1.4.15 2012
1.4.17 2013
1.4.22 2014
1.4.25 2015
1.4.33 2016
1.5.4 2017
1.5.6 2018

Now, we’re ready to add the annotation layers! We’ll take a blind stab at it before adding in further aesthetic customization:

version_counts <- count(versions, string) # no piping this time

ggplot() +
  geom_blank(data = version_counts,aes(string, n)) + # prime the scales
  geom_vline(
    data = annotation_df, aes(xintercept = as.numeric(string)),
    size = 0.5, linetype = "dotted", color = "orange"
  ) +
  geom_segment(
    data = version_counts,
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  geom_label(
    data = annotation_df, aes(string, Inf, label=release_year),
    family = font_ps, size = 2.5, color = "lightslateblue",
    hjust = 0, vjust = 1, label.size = 0
  ) +
  scale_y_comma() +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

Almost got it in ggpar 1! We need to tweak this so that the labels do not overlap each other and do not obstruct the segment bars. We can do most of this work in geom_segment() itself, plus add a bit of a tweak to the Y axis scale:

ggplot() +
  geom_blank(data = version_counts,aes(string, n)) + # prime the scales
  geom_vline(
    data = annotation_df, aes(xintercept = as.numeric(string)),
    size = 0.5, linetype = "dotted", color = "orange"
  ) +
  geom_segment(
    data = version_counts,
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  geom_label(
    data = annotation_df, aes(string, Inf, label=release_year), vjust = 1,
    family = font_ps, size = 2.5, color = "lightslateblue", label.size = 0,
    hjust = c(1, 0, 1, 1, 0, 1, 0, 0, 1, 0),
    nudge_x = c(-0.1, 0.1, -0.1, -0.1, 0.1, -0.1, 0.1, 0.1, -0.1, 0.1)
  ) +
  scale_y_comma(limits = c(0, 2050)) +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

Now, we have version and year info to we can get a better idea of the scope of exposure (and, just how much technical debt many organizations have accrued).

With the ordinal version inforamtion we can also perform other statistical operations as well. All due to the semver package.

You can find this R project over at GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)