Design Flaws in R #2 — Dropped Dimensions
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In a comment on my first post on design flaws in the R language, Longhai remarked that he has encountered problems as a result of R’s default behaviour of dropping a dimension of a matrix when you select only one row/column from that dimension. This was indeed the design flaw that I was going to get to next! I think it also points to what is perhaps a deeper design flaw.
I’ll give a toy example of the problem, which shows up regularly in real (but more complicated) contexts. Consider the following function, whose arguments are a matrix X and a vector s of column indexes, and which returns a vector containing the Euclidean norm of each row of X, looking only at columns in s:
subset.norms <- function (X, s) { sqrt(apply(X[,s]^2,1,sum)) }
Here’s an example of its use:
> M [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > subset.norms(M,c(1,3)) [1] 7.071068 8.246211 9.486833
Perhaps it would be interesting to see how the norms get smaller as we drop leading dimensions:
> for (k in 1:4) print(subset.norms(M,k:4)) [1] 12.88410 14.62874 16.43168 [1] 12.84523 14.49138 16.15549 [1] 12.20656 13.60147 15.00000 Error in apply(X[, s]^2, 1, sum) : dim(X) must have a positive length
Oops… When the loops gets to where k has the value 4, the subset k:4 contains just one dimension. When R comes to evaluating X[,s], it doesn’t return a matrix with one column, but rather a vector. The apply function doesn’t work on vectors, so we get an error message rather than the answer.
There’s a fix. The “[" operator takes a "drop"argument, which defaults to TRUE, giving the behaviour above, but which can be explicitly set to FALSE to disable conversion from a matrix to a vector in these circumstances. If we modify the function as follows:
subset.norms2 <- function (X, s) { sqrt(apply(X[,s,drop=FALSE]^2,1,sum)) }
We get the right answers:
> for (k in 1:4) print(subset.norms2(M,k:4)) [1] 12.88410 14.62874 16.43168 [1] 12.84523 14.49138 16.15549 [1] 12.20656 13.60147 15.00000 [1] 10 11 12
There are two problems with this fix. One is that in some programs, one needs to put drop=FALSE in numerous places, making the code rather hard to read. The more serious problem, and what makes this a real design flaw, is that writing code without drop=FALSE is so much easier that it's often left out even when it is needed. Indeed, since the errors typically occur only for extreme cases, it's easy for the programmer to not realize there's a problem. (As an aside, this is another context where the reversing sequences problem arises — if n is 0, subset.norms2(M,1:n) should produce all zeros, but doesn't, not because of a bug in subset.norms2 but because 1:0 doesn't produce the empty sequence.)
Can we solve the dimenion dropping problem by just changing the default for drop from TRUE to FALSE? Of course not, since this would break too many existing programs. But even if backward compatibility weren't a problem, this wouldn't be a good solution, since the times when we want drop to be TRUE are even more numerous than when we want it to be FALSE. Look at the following, for instance:
> M[2,3,drop=FALSE] [,1] [1,] 8
If drop defaults to FALSE, ordinary subscripting with single indexes for both dimensions will give a 1x1 matrix! Now, R will treat matrices as vectors (and scalars) as needed in many contexts, but propagating a dim attribute (what marks something as a matrix) all over the place will sooner or later cause problems.
What can be done at this point? I'm not sure. The best idea I can come up with is to let subscripts be separated by semicolons as well as commas, with use of a semicolon signaling that drop will be FALSE. In other words, M[i;j] would be equivalent to M[i,j,drop=FALSE], but much more concise. People could start using the semicolon whenever they're not trying to extract single elements, without breaking old programs.
The root cause of the dropped dimension problem seems to be that R does not have scalars. Instead, what looks like a scalar is actually a vector of length one. To R, there is no difference between M[2,3] and M[2:2,3:3]. So there's no way for the first of these to drop dimensions and the second to retain them.
It's probably hopeless to try to change now, but I think it would have been better for scalars to be treated as vectors or converted to vectors as necessary, without pretending that they're always vectors. Not having real scalars may seem like a unification or simplification, but it just creates problems. These start with the first thing a new user sees on trying R out, which is likely to be something like this:
> 2+2 [1] 4
Good. R knows how to add 2 and 2. But what's the "[1]" doing there?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.