Representation of numerical NA’s in R and the 1954 enigma
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve always wondered how exactly the missing value (NA) in R is represented under the hood. Last weekend I was working on a little project that gave me enough excuse to spend some time on finding this out. So, I descended into the catacombs of R and came back with some treasure. In short:
- A missing integer is repesented by the largest negative (4 word) signed integer that can be represented by your computer.
- A missing
double(real number) is represented by a special version of the defaultNaN(Not a Number) of the IEEE standard. A special role is given to the number 1954 here, but why?
Read on if you want to dig a little deeper.
Missing integers
As you may know, a lot of R‘s core is written in the C language. However, an int variable in C does not support the concept of a missing value. So, what happens in R is that a single value of the integer range is pointed out as representing a missing value. In this case it is INT_MIN (a C macro from limits.h) which determines the largest negative value that can be represented by a int variable in C. On most computers, an int variable will be 32 bits (4 8-bit words). To make things easier, we’ll assume that’s always the case here. Since 1 bit is reserved for the sign, the range of representable numbers is
![[-2^{31},\: 2^{31}-1] =[-2147483648,\: 2147483647].](https://i0.wp.com/www.markvanderloo.eu/wp-content/plugins/latex/cache/tex_f1dc2188d58d11a89198540b47dcc83e.gif?w=578)
(The range is asymmetric because 0 occupies the place of one positive number).
Now let’s compare this with R‘s integer range. The maximum integer is easily found,
since it is stored in the hidden .Machine variable.
1 2 | > .Machine$integer.max [1] 2147483647 |
So this corresponds with C‘s INT_MAX. The largest negative integer is
not present in .Machine but we can do some tests:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # store the one-but-least C-integer. The L in the end forces the number
# to be "integer", not "numeric"
x <- -2147483647L
typeof(x)
[1] "integer"
# adding an integer works fine, since we move further into the range:
typeof(x+1L)
[1] "integer"
# substracting an integer gives a warning telling us that the result is out-of-range:
> typeof(x-1L)
[1] "integer"
Warning message:
In x - 1L : NAs produced by integer overflow
# substracting a non-integer 1 ("numeric") yields a non-integer:
> typeof(x-1)
[1] "double" |
The result is out of R‘s integer range. The integer range of R is
: one integer less than you get in C. So by sacrificing only one of your four billion two hundred ninety-four million nine hundred sixty-seven thousand two hundred ninety-five integers, you get the truly awesome feature of computing with missing values.
Missing doubles
To explain how real (
) missing values are represented, we first need to spend a few words on the double type. A double is short for double precision and it is the variable type used to represent (approximations to) the real numbers in a computer.
Basically, a double represents a rounded real number in the following notation (see also the wikipedia article):
.
The sign is represented by 1 bit, the exponent
by 11 bits and the mantissa
by 52 bits, so we have 64 bits in total. The special value NaN (and also 
Inf) is coded using values of
that are not used to represent numbers. NaN is represented by 
0x7ff (hexadecimal) and
. The important thing is that it does not matter what the value of
is when representing NaN. This leaves developers with lots of room in the mantissa to give different meanings to NaN. In R the developers chose
in the mantissa to represent NA. A C-level function called R_IsNA detects the 1954 in NaN values.
A funny question is why did the R developers choose 1954? Any ol’ number would have been fine. Was it because
- It’s the year of birth of one of the developers? (I couldn’t find a match here)
- Alan Turing died in 1954? (macabre)
- Because president Eisenhower met with aliens in 1954? (ehm…)
- In 1954 Queen Elisabeth II became the reigning monarch of Australia? (well…)
Leave an answer in the comments if you have a better idea…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.