Selection in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The design of the statistical programming language R sits in a slightly uncomfortable place between the functional programming and object oriented paradigms. The upside is you get a lot of the expressive power of both programming paradigms. A downside of this is: the not always useful variability of the language’s list and object extraction operators.
Towards the end of our write-up Survive R we recommended using explicit environments with new.env(hash=TRUE,parent=emptyenv())
, assign()
and get()
to simulate mutable string-keyed maps for storing results. This advice rose out of frustration with the apparent inconsistency with the user facing R list operators. In this article we bite the bullet and discuss the R list operators a bit more clearly.The R programming language takes some of its inspiration from functional programming. The modern view of purely functional data structures emphasizes persistent or immutable data structures. The rough idea is you don’t so much add a new item to a list but you “cons-up” a new list that has the new item in it. This formulation is incredibly powerful in that it can be used to support logging, transactions, parallelism, concurrency and lazy evaluation. It takes a bit of getting used to- but you potentially gain a lot.
The natural candidates for dealing with data in this way in R would be the c()
operator (documented as the “generic function which combines its arguments”) or the list()
operator. And at first glance these operators seem to work like the traditional lisp cons operator. For example c(1,2)
builds a numeric vector with entries 1 and 2 and list(1,2)
builds a list with entries 1 and 2. These operators are composable: c(c(1,2),3)
produces a flat list with entries 1,2 and 3; while list(list(1,2),3)
produces a nested list whose first entry is a list with entries 1,2 and the second entry is the value 3 (more like Lisp’s cons operator).
However, here is where the functional design starts to interfere with the objected oriented dispatch features. The c()
operator in addition to flattening lists wipes out type information. For example: build a simple data frame d <- data.frame(x=c(1,2),y=c(1,2))
and linear model model <- lm(y~x,data=d)
(models really are the meat of R, you use R for its built in data handling and modeling capabilities- not the programming language). Now try to make an array or vector of models: c(model,model)
. The returned type is “list” (not list of models or array of lm). c(list(model),list(model))
works and c(list(model),model)
does not (so c()
‘s flatten behavior does not depend on just the type of the first argument). The flattening definitions of c()
are not sufficiently context aware and transform the models into damaged data. This loses the type carrying portions of the data that type specific dispatch depends on. Note that none of these workarounds give really convenient persistent data structures, so we will switch to the more common imperative model and try and work with mutable lists.
The right way to return a usable list of models is to call the list()
operator either with unnamed arguments as in list(model,mode)
or with named slots list(a=model,b=model)
. The second, named form, is a bit unsatisfying as we are passing in the slot names “a” and “b” in a hard-coded or non-templated way. R’s complicated argument evaluation rules defend us against the case where there is already a variable named “a” running around and guarantees we use the name “a” and not the contents of some variable with the name “a”. We can write list('a'=model,'b'=model)
to make our intended meaning more clear. But how do we set a slot progamaticly? That is how do we use a variable instead of a hard-coded string to specify the slot?
We can get programatic access to the list with the following code:
nameA <- 'a'
nameB <- 'b'
modelList <- list()
modelList[[nameA]] <- model
modelList[[nameB]] <- model
This builds a map (or pairlist or list with named slots) with slots “a” and “b” populated by linear models.
For a list with unnamed slots we would write:
modelList <- list()
modelList[[length(modelList)+1]] <- model
modelList[[length(modelList)+1]] <- model
Notice we used an odd double square bracket notation. The main generic accessors in R are “[]“, “$” and “[[]]”. It is well worth the effort to type help('[[')
into an R console and attempt to critically examine the help page. Not all advanced R users even know all three notations exist.
My unvarnished description of these operators is as follows.
"$" is the hard-coding way to get single values out of R objects. For example names(model)
tells us that linear models tend to have a slot named "coefficients". To get that data we can type in model$coefficients
. "$" has a few flaws. It accepts abbreviations; so model$co
returns the same coefficients (unless you try it on an object that also has a co
slot). So the abbreviation works until it fails quietly in ambiguous situations. Asking for slots that do not exist returns a non-signaling NULL (so you have no way of knowing if you got back NULL because that what is in the object, or if you got the field name wrong). You can check with something like 'coefficients' %in% names(model)
, but R passes around NULLs fairly quietly- so you will need a bit of luck to see the failure near your mistake. And finally "$" is hard-coded: you can't pass in a key value in a variable (model$coefficients
is essentially syntactic sugar for model$'coefficients'
).
"[[]]" is almost always the right programatic (or templated) way to get a named or column value (which really makes it unfortunate it is not a more common notation like "[]"). model[['coefficients']]
grabs the slot you want and the explicit string we showed here can be replaced by a passed in variable. You can add or override slots just by assignment (so to answer my question posed in Survive R, R does indeed have mutable name maps, they are called lists). Also "[[]]" does not accept abbreviations unless you set a flag declaring you want this sort of unsafe behavior ( model[['coef',exact=F]]
). "[[]]]" does quietly return null on wrong-slot, but that is pretty much how R rolls. The notation is clunky (this notation should be Rule 1, not Rule 42) but this is really the right notation to use. R does use its object oriented view of generic functions to add nice features to "[[]]", like the ability to extract columns form data frames (you can write d[['x']]
instead of d$x
or d[,'x']
).
"[]" is the most popular way to get a value in R. This is unfortunate as it has poor defaults and most users are using it because it is the correct notation in other languages. In R "[]" is a multi-row/column selector (or what I will call a slice selector) that has been extended to usually work correctly as an alias for the single row operator "[[]]". If "usually works" is good enough for you then vanilla "[]" is all you need (otherwise you will need to add some ugly control arguments).
The R documentation states:
The most important distinction between [, [[ and $ is that the [ can select more than one element whereas the other two select a single element.
It is safer to ignore the description of "[]" and use it as it is actually implemented: as a slice selector (and a broken one at that).
Here is the issue. Both d[1,]
and d[c language="(1),"][/c]
return the first row of our example data frame. And d[c language="(1,2),"][/c]
and d[c language="(TRUE,TRUE),"][/c]
both return two rows (exhibiting the multi-selector or slice effect). I advise thinking of "[]" as an operator where all of the arguments are lists and d[1,]
as mere syntactic sugar for d[c language="(1),"][/c]
(single elements get promoted to single element lists). This kind of convenient alias means you don't have to precisely specify if you are interested in individual rows or sets of rows.
From far outside observation, this seems to be a design principle of R: lots of convenient synonyms and aliases so the user isn't burdened with supplying detailed specifications what they want (and a lot of R quirks stem from this emphasis). This seems helpful and R data frames work very well with this level of notational merging. However the behavior is very bad when applied to R matrices.
Consider the following matrix: m <- matrix(c(1,2,3,4),nrow=2,ncol=2)
. m[c language="(1,2),"][/c]
again gives us both rows as a matrix as we would expect. m[1,]
gives us the first row as a numeric vector, which is different type in R than matrix. But we used a slightly input different notation, so getting a different type or class back is acceptable. The killer is: m[c language="(TRUE,FALSE),"][/c]
also gives us back a single rows as a numeric vector. This is a huge problem (can sporadicly crash code that isn't explicitly defending against this). R's own type system does not consider numeric vectors and matrices as substitutable. That is m[c language="(TRUE,FALSE),"][/c][1,1]
is an error (R numeric vectors do not support two dimensional indexing) and m[c language="(TRUE,TRUE),"][/c][1,1]
is okay. To add insult to injury the type of m[c language="(FALSE,FALSE),"][/c]
is again matrix (as an R numeric vector can't represent the zero-row situation, so this silliness isn't even monotone). The point is: if you are writing a script or program in R the set of rows you are going to pull out might be passed in as a variable, if a single row result is even a remote possibility you have to write code to defend against this eventuality. The defense is to "set the do not lose flag" or add a "drop=FALSE" argument: m[c language="(TRUE,FALSE),,drop=FALSE"][/c]
. You may also expect to have to guard against the zero-row situation (as an empty matrix is fundamentally different than a small matrix), but having to guard against the single row case is just having to defend yourself from the programming language in addition to having to defend yourself from the data (in our era you really to have to assume user facing data sources are hostile and prepare and defend against evil "corner cases").
This sort of type inconsistency is one of the reasons I don't like dynamic typed languages. I admit I do not enjoy, when using static typed languages, making type declarations or having to write essentially duplicate functions to perform the same operation on identical types (defining only one function like square <- function(x) {x*x}
makes a lot of sense). I get that it may be nice to have the return type of a function change to track the runtime types of its arguments, but in our example we have the type changing as a function of runtime values (not types). Really, deep down, most imperative programmers really do believe that the return type of a function application is determined by the types of its arguments during runtime application (not the detailed values). I agree function definition should be flexible and generic; it is just that function application should be at least as specific as your current runtime execution context and follow the principle of least astonishment (you should have a sporting chance of knowing the meaning of your code, even before execution).
The "R way" (in the deliberately pejorative sense) to fix this would be to extend numeric types to support two-dimensional indexing (or alter the "[]" operator to ignore one of the indices when working with one dimensional data structures). That is to paper over un-managed and un-signaled type differences by adding more notation aliases and conveniences. My objection is: these aliases then likely later conflate some other operations and types the user would have the right to usefully distinguish between. Eventually requiring the addition of even more aliases. Papering over essential difficulties and distinctions is where Postel's Law goes wrong (and creates larger, more complicated and less maintainable systems). This historic strategy in R is why so often R's defaults "feel exactly wrong." In R you have often been using the wrong operator because what you found first or most notationally concise is actually an operator defined for a different context (thus its default behavior is different than you would expect in your context). To repeat our example: the "[]" operator naturally operates over sets of rows- but is willing to accept scalar arguments (so you are not forced to type d[c language="(1)"][/c]
when you wanted d[1]
or d[[1]]
, all of which are column selections). Safe systems should hide unimportant distinctions (often implementation details) and help manage important semantic distinctions (data types with different meanings). There is indeed a limit on how many distinctions a user can put up with (which is why it is important to hide inessential distinctions). R's own type system (in the object oriented sense) decided that matrices and vectors are incompatible, but the functional/generic code does not respect that distinction (sometimes returns one or the other depending on runtime data details).
To be sure we do recommend learning and using R. We just recommend taking the extra care to use it right. Many see this as nit-picky, so we strongly emphasize the danger to sell the idea of caution. You would certainly make graver statistical errors trying to design an ad-hoc analysis in a general purpose programming language than you will experience using R's available libraries. R is really a data analysis tool, not a modern programming language. You should use R as much as practical while writing as little R code as needed. But this is actually good advice for any real programming project in any programming language; you are trying to spend the minimum effort programming to achieve some other goal, not write a lot of code. With a little care when scripting or programming R remains one of the most powerful tools to reliably automate sophisticated statistical tasks.
Related posts:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.