Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve always wondered how exactly the missing value (NA
) in R
is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:
- A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
- A missing
double
(real number) is represented by a special version of the defaultNaN
(Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?
Read on if you want to dig a little deeper.
Missing integers
As you may know, a lot of R
‘s core is written in the C
language. However, an int
variable in C
does not support the concept of a missing value. So, what happens in R
is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN
(a C
macro from limits.h
) which determines the largest negative value that can be represented by a int
variable in C
. On most computers, an int
variable will be 32 bits (4 8-bit words). To make things easier, we’ll assume that’s always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is
(The range is asymmetric because 0 occupies the place of one positive number).
Now let’s compare this with R
‘s integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine
variable.
1 2 | > .Machine$integer.max [1] 2147483647 |
So this corresponds with C
‘s INT_MAX
. The largest negative integer is
not present in .Machine
but we can do some tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # store the one-but-least C-integer. The L in the end forces the number # to be "integer", not "numeric" x <- -2147483647L typeof(x) [1] "integer" # adding an integer works fine, since we move further into the range: typeof(x+1L) [1] "integer" # substracting an integer gives a warning telling us that the result is out-of-range: > typeof(x-1L) [1] "integer" Warning message: In x - 1L : NAs produced by integer overflow # substracting a non-integer 1 ("numeric") yields a non-integer: > typeof(x-1) [1] "double" |
The result is out of R
‘s integer range. The integer range of R
is C
. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.
Missing doubles
To explain how real (double
type. A double
is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.
Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):
The sign is represented by 1 bit, the exponent NaN
(and also Inf
) is coded using values of NaN
is represented by 0x7ff
(hexadecimal) and NaN
. This leaves developers with lots of room in the mantissa to give different meanings to NaN
. In R
the developers chose NA
. A C
-level function called R_IsNA
detects the 1954 in NaN
values.
A funny question is why did the R developers choose 1954? Any ol’ number would have been fine. Was it because
- It’s the year of birth of one of the developers? (I couldn’t find a match here)
- Alan Turing died in 1954? (macabre)
- Because president Eisenhower met with aliens in 1954? (ehm…)
- In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well…)
Leave an answer in the comments if you have a better idea…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.