R : NA vs. NULL
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is common for programming languages to have a NULL value. What often leads to confusion is the fact NULL can have two distinct meanings. In the first, NULL is used to represent missing or undefined values. This is well appreciated in SQL. In the second case, NULL is the logical representation a statement that is neither TRUE nor FALSE. This indeterminacy is the basis for ternary logic. While these meanings are distinct, they are very often related. When missing values (the first meaning) are evaluated, the desired result is often an ambiguous result (the second). That is, the former implies the latter. In programming, the distinction is often unnecessary and glossed over and the concepts become confounded.
The R language has two closely related NULL-like values, NA and NULL. Both are fully support in the language by core functions (e.g, is.na, is.null, as.null, etc.). And, while NA is used exclusively in the logical sense, both are used to represent missing or undefined values. This has lead to much confusion. Here’s what the R documentation has to say:
NULL represents the null object in R: it is a reserved word.
NULL is often returned by expressions and functions whose values are
undefined.
NA is a logical constant of length 1 which contains a missing
value indicator. NA can be freely coerced to any other vector
type except raw. There are also constants NA_integer_,
NA_real_, NA_complex_ and NA_character_ of the other atomic
vector types which support missing values: all of these are
reserved words in the R language.
There is a lot of subtlety in the treatment of these values. A good way to understand the distinction between NA and NULL is through some examples:
NA | NULL |
> NA [1] NA > class(NA) [1] "logical" > NA > 1 [1] NA |
> NULL NULL > class(NULL) [1] "NULL" > NULL > 1 logical(0) |
The important distinction is that NA is a ‘logical’ value that when evaluated in an expression, yields NA. This is the expected behavior of a value that handles logical indeterminacy. NULL is its own thing and does not yield any response when evaluated in an expression, which is not how we would want or expect NA to work.
To delve deeper into the behavior we must look at how R’s basic data structures, vectors (including matrices and arrays) and lists (including data.frames) behave. Vectors and lists are similar structures, both allow for multiple values with similar accessors. There are subtle differences in the treatment of NA and NULL. Let’s take a look at how they compare:
Vectors ( inc. Matrices and Arrays ) |
List ( inc. data frames ) |
> v <- c( 1, NA, NULL) > v [1] 1 NA |
> list(1, NA, NULL) [[1]] [1] 1 [[2]] [1] NA [[3]] NULL |
What happened? NULL is not allowed in a vector. When you attempt to set it as a value in a vector, it is it is quietly ignored. This is because NULL is an object and type of its own. NULL does not have various types such as NULL_integer_. There is just NULL. By contrast, NA has NA_integer, etc. and happily coexists with any of the basic vector types vector. So for any vector (matrix or array), NA represents a missing value. NULL does not.
Now, let’s look at the lists example. This is interesting! Unlike the vector, the list can hold objects and values other than the basic types. This includes the NULL value/object. Perhaps a little inconsistent and not what we would expect. But from here, things get a little quirky, let’s try value assignment:
Vectors ( inc. Matrices and Arrays ) | List ( inc. data frames ) |
> v[[1]] <- NULL Error in v[[1]] <- NULL : more elements supplied than there are to replace |
> li <- list( 1, 2, 3 ) > li[[1]] <- NULL > li [[1]] [1] 2 [[2]] [1] 3 |
Sure enough NULL cannot be assigned to a vector. So for all purposes, NA with respect to the basic vector behaves like NULL in other languages. NULL is almost never what you want. On the list side, however, we see an idiom of NULL. Assigning NULL to list items, removes them. This behavior is a bit unexpected, but it is the idiom.
There is one final idiom to know about NULL and lists. Namely, that trying to access a list element by a non-existing name yields a NULL value.
> li$aa NULL > li[['aa']] NULL
( Note: the same is true for trying to access non-existing objects on an environment )
R does not have a consistent or intuitive way of dealing with missing and logically ambiguous values, i.e. addressing the two meanings from the beginning of this post. For vectors and basic variables, R mimics other languages and uses NA. For lists however, the syntax is more idiomatic. It is this latter case that presents difficulty. R has other quirks too. But all languages have quirks, and given R’s strength for statistical analysis, I have found no better tool for this.
>Here
c( 1, NA, NULL)
[1] 1 NA
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.