subset vs array indexing: which will cause the least grief in R?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The comments on my post outlining recommended R usage for professional developers were universally scornful, with my proposal recommending subset
receiving the greatest wrath. The main argument against using subset
appeared to be that it went against existing practice, one comment linked to Hadley Wickham suggesting it was useful in an interactive session (and by implication not useful elsewhere).
The commenters appeared to be knowledgeable R users and I suspect might have fallen into the trap of thinking that having invested time in obtaining expertise of language intricacies, they ought to use these intricacies. Big mistake, the best way to make use of language expertise is to use it to avoid the intricacies, aiming to write simply, easy to understand code.
Let’s use Hadley’s example to discuss the pros and cons of subset
vs. array indexing (normally I have lots of data to help make my case, but usage data for R is thin on the ground).
Some data to work with, which would normally be read from a file.
sample_df = data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1)) |
The following are two of the ways of extracting all rows for which a >= 4
:
subset(sample_df, a >= 4) # has the same external effect as: sample_df[sample_df$a >= 4, ] |
The subset
approach has the advantages:
- The array name,
sample_df
, only appears once. If this code is cut-and-pasted or the array name changes, the person editing the code may omit changing the second occurrence. - Omitting the comma in the array access is an easy mistake to make (and it won’t get flagged).
- The person writing the code has to remember that in R data is stored in row-column order (it is in column-row order in many languages in common use). This might not be a problem for developers who only code in R, but my target audience are likely to be casual R users.
The case for subset
is not all positive; there is a use case where it will produce the wrong answer. Let’s say I want all the rows where b
has some computed value and I have chosen to store this computed value in a variable called c
.
c=3 subset(sample_df, b == c) |
I get the surprising output:
> a b c > 1 1 5 5 > 5 5 1 1 |
because the code I have written is actually equivalent to:
sample_df[sample_df$b == sample_df$c, ] |
The problem is caused by the data containing a column having the same name as the variable used to hold the computed value that is tested.
So both subset
and array indexing are a source of potential problems. Which of the two is likely to cause the most grief?
Unless the files being processed each potentially contain many columns having unknown (at time of writing the code) names, I think the subset
name clash problem is much less likely to occur than the array indexing problems listed earlier.
Its a shame that assignment to subset is not supported (something to consider for future release), but reading is the common case and that is what we are interested in.
Yes, subset
is restricted to 2-dimensional objects, but most data is 2-dimensional (at least in my world). Again concentrate recommendations on the common case.
When a choice is available, developers should pick the construct that is least likely to cause problems, and trivial mistakes are the most common cause of problems.
Does anybody have a convincing argument why array indexing is to be preferred over subset
(not common usage is the reason of last resort for the desperate)?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.