Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Another R tip. Get in the habit of using drop = FALSE
when indexing (using [ , ]
on) data.frame
s.
In R, single column data.frame
s are often converted to vectors when manipulated. For example:
d <- data.frame(x = seq_len(3)) print(d) #> x #> 1 1 #> 2 2 #> 3 3 # not a data frame! d[order(-d$x), ] #> [1] 3 2 1
We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ]
change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,]
is also vector in this case.
The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame
named “g
” supplied as an argument: g[vec, ]
can be a data.frame
or a vector
(or even possibly a list
). However we do know if g
is a data.frame
then g[vec, , drop = FALSE]
is also a data.frame
(assuming vec
is a vector of valid row indices or a logical
vector, note: NA
induces some special cases).
We care as vector
s and data.frame
s have different semantics, so are not fully substitutable in later code.
The fix is to include drop = FALSE
as a third argument to [ , ]
.
# is a data frame. d[order(-d$x), , drop = FALSE] #> x #> 3 3 #> 2 2 #> 1 1
To pull out a column I suggest using one of the many good extraction notations (all using the fact a data.frame
is officially a list of columns):
d[["x"]] #> [1] 1 2 3 d$x #> [1] 1 2 3 d[[1]] #> [1] 1 2 3
My overall advice is: get in the habit of including drop = FALSE
when working with [ , ]
and data.frame
s. I say do this even when it is obvious that the result does in fact have more than one column.
For example write “mtcars[, c("mpg", "cyl"), drop = FALSE]
” instead of “mtcars[, c("mpg", "cyl")]
“. It is clear that for data.frame
s both forms should work the same (either selecting a data frame with two columns, or throwing an error if we have mentioned a non existent column). But longer drop = FALSE
form is safer (go further towards ensuring type stable code) and more importantly documents intent (that you wanted a data.frame
result).
One can also try base::subset(), as it has non-dropping defaults.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.