7 R Quirks That Will Drive You Nutty

Francis Smart

8 years ago

[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

< !DOCTYPE html>< !-- saved from url=(0014)about:internet -->< http-equiv="x-ua-compatible" content="IE=9" > 7 R Quirks That Will Drive You Nutty < !-- Styles for R syntax highlighter --> < !-- R syntax highlighter -->

Stumped

Every language has its idiosyncrasies. Some “designer”“ type languages have less due to extreme thoughtfulness of language engineers. I suspect Julia for example has many less quirks. However, despite its quirkiness R has become an amazingly flexible resource for a diverse range of tasks with thousands of packages and over 100,000 available commands (Rdocumentation.org) in subject matter as diverse as Pharmacokinetics, to Medical Imagining, to Psychometrics making it a quirky but effective standard in many research fields.

1. Vectors do not have defined "dimensions”

Strangely the dim “dimension” function does not work on vectors though it does work on higher dimensional matrices and arrays.

dim(1:10)

## NULL

If you want to figure out how long a vector is you need to use the length function.

length(1:10)

## [1] 10

Which would be fine except that the length function does work for matrices and will give you results based on counting the number of elements rather than on its “length” which is no longer well defined.

length(matrix(1, nrow = 10, ncol = 10))

## [1] 100

So length with vectors and dim with matrices, easy right? . Wrong. Which brings me to my next quirk.

2. Class Dropping

Matrices that are reduced to a single row become vectors and some matrix functions no longer work.

mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
mymatrix

##         [,1]    [,2]    [,3]    [,4]
## [1,]  1.3941 -0.7149 -1.7237 -1.6695
## [2,]  0.6882  1.4039 -2.2238 -0.3019
## [3,] -0.2032  1.3995 -0.3562 -0.3349

mymatrix <- matrix(rnorm(12), nrow = 3, ncol = 4)
dim(mymatrix[1:2, ])

## [1] 2 4

dim(mymatrix[1, ])

## NULL

Fortunately there is a kind of fix for this. Specifying drop=F when specifying a subset of a matrix will preserve its class as a matrix.

dim(mymatrix[1, , drop = F])

## [1] 1 4

Though, you do not need to worry about your matrix becoming a vector if you go to zero columns or rows.

dim(mymatrix[-1:-3, ])

## [1] 0 4

3. Negative subscripts

These are nothing to be bothered by since they are entirely optional. However, if you run into them they can certainly throw you for a loop. Negative subscripts act as though you are subsetting a matrix or vector by removing the specified columns or rows.

For example:

myvect <- -4:5
myvect

##  [1] -4 -3 -2 -1  0  1  2  3  4  5

They can be a little tricky to work with. Say we wanted to remove every other number in our 10 digit vector.

myvect[c 1="-3," 2="-5," 3="-7," 4="-9)" language="(-1,"][/c]

## [1] -3 -1  1  3  5

Or more easily

myvect[-seq(1, 9, 2)]

## [1] -3 -1  1  3  5

4. NA Values Specifying Missing Data

These different special values can be quite challenging to work with. The biggest challenge for me is detecting and adjusting my code when one of these values pops up.

NA indicates a value that is “Not Available” or missing. Having an indicator for missing is standard. Stata uses a “.” while other programs use a -9999 or other unlikely number. The problem with NAs are not that they exist since almost all data sets are going to have some missing values at some point, but the frequency in which they interrupt normal function behavior. For instance if you have a long vector as 1:100, if one of the values is NA then most of the functions you attempt on it will fail.

a <- c(1:20, NA)
sum(a)

## [1] NA

max(a)

## [1] NA

min(a)

## [1] NA

cor(a, a)

## [1] NA

Once again this is not a problem necessarily, but it is annoying and can easily create unexpected problems. There are several solutions to the problem. One may be to remove observations from your data which have any values which are missing. This can create its own issues since NAs do not conform to logical operators (in contrast with Stata, SPSS, and all other statistical languages I know of). Thus

a2 <- a[a != NA]
a2

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Fails while

a3 <- a[!is.na(a)]
a3

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

does the job.

Alternatively, many commonly used functions have special arguments expressly devoted to how to handle missing data.

sum(a, na.rm = T)

## [1] 210

max(a, na.rm = T)

## [1] 20

min(a, na.rm = T)

## [1] 1

cor(a, a, use = "pairwise.complete.obs")

## [1] 1

5. “Empty values”: NULL, integer(0), numeric(0), logical(0)

These four values are several of R's empty value indicators. I don't believe they exhaust the list and I am sure each of them has its own unique properties which I cannot say I fully understand. The challenge usually lies in how to detect when they occur. As with NAs they resist logical arguments

a == NULL

## logical(0)

logical(0)

## logical(0)

# Even:
(a == NULL) == logical(0)

## logical(0)

What we are forced to do instead is detect the vector length of these values.

length(NULL)

## [1] 0

length(integer(0))

## [1] 0

length(numeric(0))

## [1] 0

length(logical(0))

## [1] 0

This trick works for an empty matrix as well.

length(mymatrix[!1:3, !1:4])

## [1] 0

Though dim is not a bad choice.

dim(mymatrix[!1:3, !1:4])

## [1] 0 0

6. Attach Does not Do What You Want it to Do

Attach and detach are functions which ostensibly promise to fulfill the desires of a user to be able to rapidly reference, use, and make changes to an active data set in a similar fashion as Stata or SPSS.

However, this is a mistake. IT IS NOT the solution to making R work more like Stata. If a data frame is “attached” in R it will make its columns accessible through to common interface.

mydata <- data.frame(a1 = 1:30, b1 = 31:60)
attach(mydata)
b1

##  [1] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## [24] 54 55 56 57 58 59 60

However, if you want to create a new variable in the data set say

c1 <- a1 + b1

Then you are going to have a problem because sure you just created c1, but c1 is not part of mydata.

names(mydata)

## [1] "a1" "b1"

You may add c to mydata

mydata$c1 <- c1
names(mydata)

## [1] "a1" "b1" "c1"

But the “c1” in my data is a different object than the c1 in the working memory.

mydata$c1 <- mydata$c1 * -1
head(mydata)

##   a1 b1  c1
## 1  1 31 -32
## 2  2 32 -34
## 3  3 33 -36
## 4  4 34 -38
## 5  5 35 -40
## 6  6 36 -42

head(c1)

## [1] 32 34 36 38 40 42

Which is not necessarily a problem since we can drop our working memory c1.

rm(c1)

But c1 is still not available even though mydata is still attached.

try(c1)

Which means we have “attached” our mydata object but that was back when mydata did not include c1. We need to reattach it to update the values. But first we need to detach our old mydata:

detach(mydata)
attach(mydata)
head(c1)

## [1] -32 -34 -36 -38 -40 -42

Thus we realize that attach is not a workable solution for handling data. Instead we are forced to use R's elaborate sub-scripting to manipulate our data such as

mydata$c1 <- mydata$a1 + mydata$b1

This is not unworkable though it can be annoying. There are shortcuts in R for helping to specify a dataframe or list to work from but they end up facing similar challenges to that of attach and ultimately create code which is unnecessarily complex.

7. R Functional Help Is Grouped

For instance ?gsub will return a list of seven related functions grep, grepl, sub, regexpr, gregexpr, and regexec as well as gsub which all share some commonalities with gsub, yet each do something somewhat different. I can understand why the documentation could have been originally organized this way since I imagine it was rather terse initially (as some help files persist in being). As help files got fleshed out, it just never happened that functions were broken out of their clusters.

It is worth noting that I have never seen this kind of function clumping in any of the help files in packages in R. For example library(ggplot2); ?ggplot ;?geom_point all return help files specific to a single function apiece.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.