Sow the seeds, know the seeds

[This article was first published on Maëlle, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1). Last week I received an R script from a colleague in which he used a weird number in set.seed (maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh package to get a very rough answer (an answer seedling from the question seed?).

From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr package. I just want the thing inside set.seed() from the text matches returned by the API.

get_seeds_from_matches <- function(item){
  url <- item$html_url
  matches <- item$text_matches
  matches <- unlist(lapply(matches, "[[", "fragment"))
  matches <- stringr::str_split(matches, "\\\n", simplify = TRUE)
  matches <- stringr::str_extract(matches, "set\\.seed\\(.*\\)")
  matches <- stringr::str_replace(matches, "set\\.seed\\(", "")
  seeds <- stringr::str_replace(matches, "\\).*", "")
  seeds <- seeds[!is.na(seeds)]
  tibble::tibble(seed = seeds,
             url = rep(url, length(seeds)))
}

After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.

Note that the header "Accept" = 'application/vnd.github.v3.text-match+json' is very important, without it you wouldn’t get the text fragments in the results.

library("gh")
seeds <- NULL

ok <- TRUE
page <- 1
while(ok){
  matches <- try(gh("/search/code", q = "set.seed&language:r",
                    .token = Sys.getenv("GITHUB_PAT"),
                    .send_headers = c("Accept" = 'application/vnd.github.v3.text-match+json'),
                    page = page), silent = TRUE)
  ok <- !is(matches, "try-error")
 
  if(ok){
    seeds <- bind_rows(seeds, bind_rows(lapply(matches$items, 
                                               get_seeds_from_matches)))
  }

  page <- page + 1
  # wait 2 minutes every 30 pages
  if(page %% 30 == 1 & page > 1){
    Sys.sleep(120)
  }
  
}

save(seeds, file = "data/2017-04-12-seeds.RData.RData")
library("magrittr")
load("data/2017-04-12-seeds.RData")
head(seeds) %>%
  knitr::kable()
seed url
1 https://github.com/berndbischl/ParamHelpers/blob/9d374430701d94639cc78db84f91a0c595927189/tests/testthat/helper_zzz.R
1 https://github.com/TypeFox/R-Examples/blob/d0917dbaf698cb8bc0789db0c3ab07453016eab9/ParamHelpers/tests/testthat/helper_zzz.R
1 https://github.com/cran/ParamHelpers/blob/92a49db23e69d32c8ae52585303df2875d740706/tests/testthat/helper_zzz.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R
4.0 https://github.com/KellyBlack/R-Object-Oriented-Programming/blob/efbb0b81063baa30dd9d56d5d74b3f73b12b4926/chapter4/chapter_4_ex11.R

I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.

Let’s have a look at the most frequent seeds in the sample.

table(seeds$seed) %>%
  broom::tidy() %>%
  dplyr::arrange(desc(Freq)) %>%
  head(n = 12) %>%
  knitr::kable()
Var1 Freq
seed 312
1 134
123 60
iseed 48
10 47
13121098 28
ss 24
20 21
1234 18
42 18
123456 15
0 14

So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?

I was surprised by two values. First, 13121098.

dplyr::filter(seeds, seed == "13121098") %>%
  head(n = 10) %>% 
  knitr::kable()
seed url
13121098 https://github.com/DJRumble/Swirl-Course/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/swirldev/swirl_courses/blob/b3d432bfdf480c865af1c409ee0ee927c1fdbda0/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/1vbutkus/swirl/blob/310874100536e1e7c66861eced9ecb52939a3e0a/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/gotitsingh13/swirldev/blob/b7369b974ba76716fbcf6101bcbdc2db2f774d18/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/pauloramazza/swirl_courses/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/ildesoft/Swirl_Courses/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/hrdg/Regression_Models/blob/22f47ecf2ae62f553aa132d3d948cc6b4e1599cc/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Rizwanabro/Swirl-Course/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/mkgiitr/Data-Analytics/blob/1d659db1e9137b1fe595a6ef3356887de431b1be/win-library/3.1/swirl/Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R
13121098 https://github.com/Jutair/R-programming-Coursera/blob/4faeed6ca780ee7f14b224c293cae77293146f37/Swirl/Rsubversion/branches/Writing_swirl_Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R

I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?

The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)