Sow the seeds, know the seeds
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When you do simulations, for instance in R, e.g. drawing samples from a distribution, it’s best to set a random seed via the function set.seed
in order to have reproducible results. The function has no default value. I think I mostly use set.seed(1)
. Last week I received an R script from a colleague in which he used a weird number in set.seed
(maybe a phone number? or maybe he let his fingers type randomly?), which made me curious about the usual seed values. As in my blog post about initial commit messages I used the Github API via the gh
package to get a very rough answer (an answer seedling from the question seed?).
From Github API search endpoint you can get up to 1,000 results corresponding to a query which in the case of set.seed
occurrences in R code isn’t the whole picture but hopefully a good sample. I wrote a function to treat the output of a query to the API where I take advantage of the stringr
package. I just want the thing inside set.seed()
from the text matches returned by the API.
get_seeds_from_matches <- function(item){
url <- item$html_url
matches <- item$text_matches
matches <- unlist(lapply(matches, "[[", "fragment"))
matches <- stringr::str_split(matches, "\\\n", simplify = TRUE)
matches <- stringr::str_extract(matches, "set\\.seed\\(.*\\)")
matches <- stringr::str_replace(matches, "set\\.seed\\(", "")
seeds <- stringr::str_replace(matches, "\\).*", "")
seeds <- seeds[!is.na(seeds)]
tibble::tibble(seed = seeds,
url = rep(url, length(seeds)))
}
After that I made the queries themselves, pausing every 30 pages because of the rate limiting, and adding a try
around the call in order to stop as soon as I reached the 1,000 results. Not a very elegant solution but I wasn’t in a perfectionnist mood.
Note that the header "Accept" = 'application/vnd.github.v3.text-match+json'
is very important, without it you wouldn’t get the text fragments in the results.
library("gh")
seeds <- NULL
ok <- TRUE
page <- 1
while(ok){
matches <- try(gh("/search/code", q = "set.seed&language:r",
.token = Sys.getenv("GITHUB_PAT"),
.send_headers = c("Accept" = 'application/vnd.github.v3.text-match+json'),
page = page), silent = TRUE)
ok <- !is(matches, "try-error")
if(ok){
seeds <- bind_rows(seeds, bind_rows(lapply(matches$items,
get_seeds_from_matches)))
}
page <- page + 1
# wait 2 minutes every 30 pages
if(page %% 30 == 1 & page > 1){
Sys.sleep(120)
}
}
save(seeds, file = "data/2017-04-12-seeds.RData.RData")
library("magrittr")
load("data/2017-04-12-seeds.RData")
head(seeds) %>%
knitr::kable()
seed | url |
---|---|
1 | https://github.com/berndbischl/ParamHelpers/blob/9d374430701d94639cc78db84f91a0c595927189/tests/testthat/helper_zzz.R |
1 | https://github.com/TypeFox/R-Examples/blob/d0917dbaf698cb8bc0789db0c3ab07453016eab9/ParamHelpers/tests/testthat/helper_zzz.R |
1 | https://github.com/cran/ParamHelpers/blob/92a49db23e69d32c8ae52585303df2875d740706/tests/testthat/helper_zzz.R |
4.0 | https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R |
4.0 | https://github.com/ACP-KR/AsanAdvR/blob/0517e88efce94266997d680e8b5a7c2a97c9277d/R-Object-Oriented-Programming-master/chapter4/chapter_4_ex11.R |
4.0 | https://github.com/KellyBlack/R-Object-Oriented-Programming/blob/efbb0b81063baa30dd9d56d5d74b3f73b12b4926/chapter4/chapter_4_ex11.R |
I got 984 entries, not 1,000 so maybe I lost some seeds in the process or the results weren’t perfect. The reason why I also added the URL of the script to the results was to be able to go and look at the code around surprising seeds.
Let’s have a look at the most frequent seeds in the sample.
table(seeds$seed) %>%
broom::tidy() %>%
dplyr::arrange(desc(Freq)) %>%
head(n = 12) %>%
knitr::kable()
Var1 | Freq |
---|---|
seed | 312 |
1 | 134 |
123 | 60 |
iseed | 48 |
10 | 47 |
13121098 | 28 |
ss | 24 |
20 | 21 |
1234 | 18 |
42 | 18 |
123456 | 15 |
0 | 14 |
So the most prevalent seed is a mystery because I’m not motivated enough to go scrape the code to find if the seed gets assigned a value before, like in that tweet I saw today. I was happy that 1 was so popular, maybe it means I belong?
I was surprised by two values. First, 13121098.
dplyr::filter(seeds, seed == "13121098") %>%
head(n = 10) %>%
knitr::kable()
seed | url |
---|---|
13121098 | https://github.com/DJRumble/Swirl-Course/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/swirldev/swirl_courses/blob/b3d432bfdf480c865af1c409ee0ee927c1fdbda0/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/1vbutkus/swirl/blob/310874100536e1e7c66861eced9ecb52939a3e0a/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/gotitsingh13/swirldev/blob/b7369b974ba76716fbcf6101bcbdc2db2f774d18/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/pauloramazza/swirl_courses/blob/4e2771141e579904eb6dd32bce51ff6e0d840d44/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/ildesoft/Swirl_Courses/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/hrdg/Regression_Models/blob/22f47ecf2ae62f553aa132d3d948cc6b4e1599cc/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/Rizwanabro/Swirl-Course/blob/3e7f43cecbeb41e92e4f5972658f9b293e0e4b84/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/mkgiitr/Data-Analytics/blob/1d659db1e9137b1fe595a6ef3356887de431b1be/win-library/3.1/swirl/Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
13121098 | https://github.com/Jutair/R-programming-Coursera/blob/4faeed6ca780ee7f14b224c293cae77293146f37/Swirl/Rsubversion/branches/Writing_swirl_Courses/Regression_Models/Residuals_Diagnostics_and_Variation/initLesson.R |
I went and had a look and it seems most repositories correspond to code learnt in a Coursera course. I have taken a few courses from that specialization and loved it but I don’t remember learning about the special seed, too bad. Well I guess everyone used it to reproduce results but what does this number mean in the first place? Who typed it? A cat walking on the keyboard?
The other number that surprised me was 42 but then I remembered it is the “Answer to the Ultimate Question of Life, the Universe, and Everything” . I’d therefore say that this might be the coolest random seed. Now I can’t tell you whether it produces better results. Maybe it helps when your code actually tries to answer the Ultimate Question of Life, the Universe, and Everything?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.