Measuring & Monitoring Internet Speed with R

hrbrmstr

4 years ago

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Working remotely has many benefits, but if you work remotely in an area like, say, rural Maine, one of those benefits is not massively speedy internet connections. Being able to go fast and furious on the internet is one of the many things I miss about our time in Seattle and it is unlikely that we’ll be seeing Google Fiber in my small town any time soon. One other issue is that residential plans from evil giants like Comcast come with things like “bandwidth caps”. I suspect many WFH-ers can live within those limits, but I work with internet-scale data and often shunt extracts or whole datasets to the DatCave™ server farm for local processing. As such, I pay an extra penalty as a Comcast “Business-class” user that has little benefit besides getting slightly higher QoS and some faster service response times when there are issues.

Why go into all that? Well, I recently decided to bump up the connection from 100 Mb/s to 150 Mb/s (and managed to do so w/o increasing the overall monthly bill at all) but wanted to have a way to regularly verify I am getting what I’m paying for without having to go to an interactive “speed test” site.

There’s a handy speedtest-cli?, which is a python-based module with a command-line interface that can perform speed tests against Ookla’s legacy server. (You’ll notice that link forwards you to their new socket-based test service; neither speedtest-cli nor the code in the package being shown here uses that yet)

I run plenty of ruby, python, node and go (et al) programs on the command-line, but I wanted a way to measure this in R-proper (storing individual test results) on a regular basis as well as run something from the command-line whenever I wanted an interactive test. Thus begat speedtest?.

Testing the Need for Speed

After you devtools::install_github("hrbrmstr/speedtest") the package, you can either type speedtest::spd_test() at an R console or Rscript --quiet -e 'speedtest::spd_test()' on the command-line to see something like the following:

What you see there is a short-hand version of what’s available in the package:

spd_best_servers: Find “best” servers (latency-wise) from master server list
spd_closest_servers: Find “closest” servers (geography-wise) from master server list
spd_compute_bandwidth: Compute bandwidth from bytes transferred and time taken
spd_config: Retrieve client configuration information for the speedtest
spd_download_test: Perform a download speed/bandwidth test
spd_servers: Retrieve a list of SpeedTest servers
spd_upload_test: Perform an upload speed/bandwidth test
spd_test: Test your internet speed/bandwidth

The general idiom is to grab the test configuration file, collect the master list of servers, figure out which servers you’re going to test against and perform upload/download tests + collect the resultant statistics:

library(speedtest)
library(stringi)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

config <- spd_config()

servers <- spd_servers(config=config)
closest_servers <- spd_closest_servers(servers, config=config)
only_the_best_severs <- spd_best_servers(closest_servers, config)

glimpse(spd_download_test(closest_servers[1,], config=config))
## Observations: 1
## Variables: 15
## $ url     <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat     <dbl> 42.3875
## $ lng     <dbl> -71.1
## $ name    <chr> "Somerville, MA"
## $ country <chr> "United States"
## $ cc      <chr> "US"
## $ sponsor <chr> "Axcelx Technologies LLC"
## $ id      <chr> "5960"
## $ host    <chr> "speed0.xcelx.net:8080"
## $ url2    <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min     <dbl> 14.40439
## $ mean    <dbl> 60.06834
## $ median  <dbl> 55.28457
## $ max     <dbl> 127.9436
## $ sd      <dbl> 34.20695

glimpse(spd_upload_test(only_the_best_severs[1,], config=config))
## Observations: 1
## Variables: 18
## $ ping_time      <dbl> 0.02712567
## $ total_time     <dbl> 0.059917
## $ retrieval_time <dbl> 2.3e-05
## $ url            <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat            <dbl> 42.3875
## $ lng            <dbl> -71.1
## $ name           <chr> "Somerville, MA"
## $ country        <chr> "United States"
## $ cc             <chr> "US"
## $ sponsor        <chr> "Axcelx Technologies LLC"
## $ id             <chr> "5960"
## $ host           <chr> "speed0.xcelx.net:8080"
## $ url2           <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min            <dbl> 6.240858
## $ mean           <dbl> 9.527599
## $ median         <dbl> 9.303148
## $ max            <dbl> 12.56686
## $ sd             <dbl> 2.451778
```

Spinning More Wheels

The default operation for each test function is to return summary statistics. If you dig into the package source, or look at how the legacy measuring site performs the speed test tasks, you’ll see that the process involves uploading and downloading a series of files (or, more generally, random data) of increasing size and calculating how long it takes to perform those actions. Smaller size tests will have skewed results on fast connections with low latency, which is why the sites do the incremental tests (and why you see the “needles” move the way they do on interactive tests). We can disable summary statistics and retrieve all of results of individual tests (as we’ll do in a bit).

Speed tests are also highly dependent on the processing capability of the target server as well as the network location (hops away), “quality” and load. You’ll often see interactive sites choose the “best” server. There are many ways to do that. One is geography, which has some merit, but should not be the only factor used. Another is latency, which is comparing a small connection test against each server and measuring which one comes back the fastest.

It’s straightforward to show the impact of each with a simple test harness. We’ll pick 3 geographic “closest” geographic servers, 3 “best” latency servers (those may overlap with “closest”) and 3 randomly chosen ones:

set.seed(8675309)

bind_rows(

  closest_servers[1:3,] %>%
    mutate(type="closest"),

  only_the_best_severs[1:3,] %>%
    mutate(type="best"),

  filter(servers, !(id %in% c(closest_servers[1:3,]$id, only_the_best_severs[1:3,]$id))) %>%
    sample_n(3) %>%
    mutate(type="random")

) %>%
  group_by(type) %>%
  ungroup() -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 9 x 5
##                   sponsor            name       country
##                     <chr>           <chr>         <chr>
## 1 Axcelx Technologies LLC  Somerville, MA United States
## 2                 Comcast      Boston, MA United States
## 3            Starry, Inc.      Boston, MA United States
## 4 Axcelx Technologies LLC  Somerville, MA United States
## 5 Norwood Light Broadband     Norwood, MA United States
## 6       CCI - New England  Providence, RI United States
## 7                 PirxNet         Gliwice        Poland
## 8           Interoute VDC Los Angeles, CA United States
## 9                   UNPAD         Bandung     Indonesia
## # ... with 2 more variables: host <chr>, type <chr>

As predicted, there are some duplicates, but we’ll perform upload and download tests for each of them and compare the results. First, download:

map_df(1:nrow(to_compare), ~{
  spd_download_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> dl_results_full

mutate(dl_results_full, type=stri_trans_totitle(type)) %>%
  ggplot(aes(type, bw, fill=type)) +
  geom_quasirandom(aes(size=size, color=type), width=0.15, shape=21, stroke=0.25) +
  scale_y_continuous(expand=c(0,5), labels=c(sprintf("%s", seq(0,150,50)), "200 Mb/s"), limits=c(0,200)) +
  scale_size(range=c(2,6)) +
  scale_color_manual(values=c(Random="#b2b2b2", Best="#2b2b2b", Closest="#2b2b2b")) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Download bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

I’m going to avoid hammering really remote servers with the upload test (truth-be-told I just don’t want to wait as long as I know it’s going to take):

bind_rows(
  closest_servers[1:3,] %>% mutate(type="closest"),
  only_the_best_severs[1:3,] %>% mutate(type="best")
) %>%
  distinct(.keep_all=TRUE) -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 6 x 5
##                   sponsor           name       country
##                     <chr>          <chr>         <chr>
## 1 Axcelx Technologies LLC Somerville, MA United States
## 2                 Comcast     Boston, MA United States
## 3            Starry, Inc.     Boston, MA United States
## 4 Axcelx Technologies LLC Somerville, MA United States
## 5 Norwood Light Broadband    Norwood, MA United States
## 6       CCI - New England Providence, RI United States
## # ... with 2 more variables: host <chr>, type <chr>

map_df(1:nrow(to_compare), ~{
  spd_upload_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> ul_results_full

ggplot(ul_results_full, aes(x="Upload Test", y=bw)) +
  geom_quasirandom(aes(size=size, fill="col"), width=0.1, shape=21, stroke=0.25, color="#2b2b2b") +
  scale_y_continuous(expand=c(0,0.5), breaks=seq(0,16,4),
                     labels=c(sprintf("%s", seq(0,12,4)), "16 Mb/s"), limits=c(0,16)) +
  scale_size(range=c(2,6)) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Upload bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

As indicated, data payload size definitely impacts the speed tests (as does “proximity”), but you’ll also see variance if you run the same tests again with the same target servers. Note, also, that the “command-line” ready function tries to make the quickest target selection and only chooses the max values.

FIN

The github site has a list of TODOs and any/all are welcome to pile on (full credit for any contributions, as usual). So kick the tyres, file issues and start measuring!

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.