Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The R functions base::sample
and base::sample.int
are functions that include extra “conveniences” that seem to have no purpose beyond encouraging grave errors. In this note we will outline the problem and a suggested work around. Obviously the R developers are highly skilled people with good intent, and likely have no choice in these matters (due to the need for backwards compatibility). However, that doesn’t mean we can’t take steps to write safer and easier to debug code.
“The Monkey’s Paw” William Wymark Jacobs, 1902.
Suppose we were given data in the following form:
set.seed(2562)
x <- 10*rnorm(5)
print(x)
# [1] -17.442331 7.361322 -10.537903 -4.208578 -1.560607
goodIndices <- which(x>0)
print(goodIndices)
# [1] 2
Further suppose our goal is to generate a sample of size 5 of the values of x
from only the goodIndices
positions. That is a sample (with replacement) of the positive values from our vector x
. I challenge a working R developer who has used base::sample
or base::sample.int
regularly to say they have never written at least one of the following errors at some time:
sample(x[goodIndices],size=5,replace=TRUE)
# [1] 5 6 1 3 2
x[sample(goodIndices,size=5,replace=TRUE)]
# [1] 7.361322 -17.442331 7.361322 -17.442331 7.361322
These samples are obviously wrong, but you will notice this only if you check. There is only one positive value in x
(7.361322
) so the only possible legitimate sample of 5 positive values under replacement is c(7.361322,7.361322,7.361322,7.361322,7.361322)
. Notice we never got this, and received no diagnostic. A bad sample like this can take a long time to find through its pernicious effects in downstream code.
Notice the following code works (because it reliably prohibits triggering the horrid special case):
as.numeric(sample(as.list(x[goodIndices]),size=5,replace=TRUE))
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
x[as.numeric(sample(as.list(goodIndices),size=5,replace=TRUE))]
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
x[goodIndices[sample.int(length(goodIndices),size=5,replace=TRUE)]]
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
As always: this is a deliberately trivial example so you can see the problem clearly.
So what is going on? The issue given in help('sample')
:
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes place from 1:x. Note that this convenience feature may lead to undesired behaviour when x is of varying length in calls such as sample(x).
This little gem is the first paragraph of the “Details” section of help('sample')
. The authors rightly understand that more important than knowing the intended purpose of base::sample
is to first know there is a sneaky exception hardcoded into its behavior. In some situations base::sample
assumes you really meant to call base::sample.int
and switches behavior.
Here is the code confirming this “convenience.”
> print(base::sample)
function (x, size, replace = FALSE, prob = NULL)
{
if (length(x) == 1L && is.numeric(x) && x >= 1) {
if (missing(size))
size <- x
sample.int(x, size, replace, prob)
}
else {
if (missing(size))
size <- length(x)
x[sample.int(length(x), size, replace, prob)]
}
}
<bytecode: 0x103102340>
<environment: namespace:base>
If we meant to call base::sample.int
we certainly could have. There aren’t even any of the traditional “don’t lose flag”s available (such as “drop=FALSE
“, “stringsAsFactors=FALSE
“, or “type='response'
“). This “convenience” makes it impossible to reliably use base::sample
without some trick (such as hiding our vector in a list). Our current advice is to use the following two replacement functions:
sampleint <- function(n,size,replace=FALSE,prob=NULL) {
if((!is.numeric(n)) || (length(n)!=1)) {
stop("sampleint: n must be a numeric of length exactly 1")
}
if(missing(size)) {
size <- n
}
if((!is.numeric(size)) || (length(size)!=1)) {
stop("sampleint: size must be a numeric of length exactly 1")
}
sample.int(n,size,replace,prob)
}
samplex <- function(x,size,replace=FALSE,prob=NULL) {
if(missing(size)) {
size <- length(x)
}
if((!is.numeric(size)) || (length(size)!=1)) {
stop("samplex: n must be a numeric of length exactly 1")
}
x[sampleint(length(x), size, replace, prob)]
}
With these functions loaded you can write more natural code:
samplex(x[goodIndices],size=5,replace=TRUE)
# [1] 7.361322 7.361322 7.361322 7.361322 7.361322
As a bonus we included sampleint
which actually checks its arguments (a very good thing for library code to do) catching if the analyst accidentally writes “sample.int(1:10,size=10,replace=TRUE)
” or “sample.int(seq_len(10),size=10,replace=TRUE)
” (which return 10 copies of 1!) when they meant to write “sample.int(10,size=10,replace=TRUE)
“.
Obviously it is the data scientist’s responsibility to know actual function semantics, to write correct code, and to check their intermediate results. However, it traditionally is the responsibility of library code to help in this direction by having regular behavior (see the principle of least astonishment) and to signal errors in the case of clearly invalid arguments (reporting errors near their occurrence makes debugging much easier). Nobody enjoys working with Monkey’s Paw style libraries (that are technically “correct” but not truly helpful).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.