How to select a seed for simulation or randomization
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you need to generate a randomization list for a clinical trial, do some simulations or perhaps perform a huge bootstrap analysis, you need a way to draw random numbers. Putting many pieces of paper in a hat and drawing them is possible in theory, but you will probably be using a computer for doing this. The computer, however, does not generate random numbers. It generates pseudo random numbers. They look and feel almost like real random numbers, but they are not random. Each number in the sequence is calculated from its predecessor, so the sequence has to begin somewhere; it begins in the seed – the first number in the sequence.
Knowing the seed is a good idea. It enables reproducing the analysis, the simulation or the randomization list. If you run a clinical trial, reproducibility is crucial. You must know at the end of the trial which patient was randomized to each treatment; otherwise you will throw all your data to the garbage. During the years I worked at Teva Pharmaceuticals, we took every possible safety measure: We burnt the randomization lists, the randomization SAS code and the randomization seed on a CD and kept it in a fire-proof safe. We also kept all this information in analog media. Yes, we printed the lists, the SAS code and the seed on paper, and these were also kept in the safe.
Using the same seed every time is not a good idea. If you use the same seed every time, you get the same sequence of pseudo-random numbers every time, and therefore your numbers are not pseudo-random anymore. Selecting a different seed every time is good practice.
How do you select the seed? Taking a number out of your head is still not a good idea. Our heads are biased. Like passwords, people tend to use some seeds more often than other possible seeds. Don’t be surprised if you see codes with seeds like 123, 999 or 31415.
The best practice is to choose a random seed, but this creates a magic circle. You can still turn to your old hat and some pieces of paper, but there is a simple alternative: generate the seed from the computer clock.
This is how you do it in R by using the Sys.time() function: Get the system time, convert it to an integer, and you’re done. In practice, I take only the last 5 digits of that integer. And off course, I keep the chosen seed on record. I also save the Sys.time value and its integer value, just in case. Finally, I hardcode the seed I get. This is the R code:
> # convert the time into a numeric variable > initial_seed=as.integer(initial_seed) > print (initial_seed) [1] 1552576418 > # take the last five digits f the initial seed > the_seed=initial_seed %% 100000 > print(the_seed) # 76151 [1] 76418 > set.seed(76418) > # do your simulation > print(rnorm(3)) [1] -0.1255811 1.1614262 -0.8534025 > # reproduce your simulation > set.seed(76418) > print(rnorm(3)) [1] -0.1255811 1.1614262 -0.8534025
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.