Site icon R-bloggers

Introduction to missing data (NAs) in R

[This article was first published on R on R (for ecology), and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As many of us know, science is not a perfect process. Maybe you can’t get out in the field on a certain day. Maybe you can only sample a portion of what needs to get done. Or maybe you’re downloading public data sets and they aren’t lining up perfectly. All of these can result in missing data, which can be a real pain when it comes time for analysis.

Another common source of missing data, especially when recording species abundance data in community ecology, is when you forget to write a ‘0’ and instead leave the entry blank. In the moment you might know that blank entries mean zero, but give it just a few weeks and you’ll be scratching your head! In those cases it’s often best to label those entries as unknown or missing.

In this tutorial, I’m going to explain what exactly an NA value is, how you can find NAs in your data, and how you can remove them.

What does it mean to have NAs in my data?

NAs represent missing values in R. This is pretty common if you’re importing data from Excel and have some empty cells in the spreadsheet. When you load the data into R, the empty cells will be populated with NAs.

< !-- I'm being a bit redundant here, but I think that helps: -->
Note: missing data points, or those where you don’t actually know what the true value should be, are marked as NA (which stands for ‘Not Available’) in R. In fact, you’ll notice the color change when you type NA in your code since R already knows what that means.
# Read in an example data set with NAs
ex <- read.csv("example_data.csv")
# View data
ex

## example data set
## 1 1 2 4
## 2 NA 2 4
## 3 16 1 4
## 4 2 NA 5
## 5 3 1 NA
## 6 6 7 8

Click here to download the example_data.csv file if you want to follow along.

NAs cannot be treated like other types of data (e.g, strings, numeric values). For example, you can’t perform math with them or use them in logical comparisons. If you do so, all you’ll get is an NA. In the following examples, all positions in the vector with NA just return NA again, no matter what operation is performed. We also get NA if we use mathematical functions such as sum() on the vector, because R can’t add NAs.

# Create a vector with NAs
v <- c(1.2, 4.5, NA, 8.9, NA)
# Can we do math with NAs?
v + 1

## [1] 2.2 5.5 NA 9.9 NA

sum(v)

## [1] NA

# Can we perform logical comparisons?
v < 7

## [1] TRUE TRUE NA FALSE NA

v == 4.5

## [1] FALSE TRUE NA FALSE NA

And the reason of course is simple… What’s the answer to 5 + 'some unknown number' ?

Have you figured it out yet?

The answer is 'some unknown number'! 😄

Thus: 5 + NA = NA

How can I detect NAs in my data?

So how can we see if we have NAs in our data? We normally use == to see if a value is equal to another one. Let’s see if that will work on our vector. We know that there’s an NA in the 3rd position of our vector.

# Create a vector with NAs
v <- c(1.2, 4.5, NA, 8.9, NA)

So theoretically, v == NA should return FALSE FALSE TRUE FALSE TRUE.

# Are there any NAs in our vector?
v == NA

## [1] NA NA NA NA NA

But this code just gives us NAs. Unfortunately, NAs don’t work with any kind of logical operator either.

Same as with math operations, NA is just a placeholder for 'I don't know the real value', so asking does NA == NA, is the same as saying does 'some unknown number' == 'some unknown number', which clearly has no known answer.

Luckily, R gives us a special function to detect NAs. This is the is.na() function. And actually, if you try to type my_vector == NA, R will tell you to use is.na() instead.

is.na() will work on individual values, vectors, lists, and data frames. It will return TRUE or FALSE where you have an NA or where you don’t.

# Which values in my vector are NA?
is.na(v)

## [1] FALSE FALSE TRUE FALSE TRUE

# Which values in my data frame are NA?
is.na(ex)

## example data set
## [1,] FALSE FALSE FALSE
## [2,] TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE TRUE FALSE
## [5,] FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE

You can also combine is.na() with sum() and which() to figure out how many NAs you have and where they’re located.

# How many NAs in my data frame?
sum(is.na(ex))

## [1] 3

# Which row contains an NA in the 'data' column?
which(is.na(ex$data))

## [1] 4

# Which vector positions contain NAs?
which(is.na(v))

## [1] 3 5
Note: the reason sum(is.na(ex)) works is because is.na() first converts your values to TRUE or FALSE, and applying math operations to T/F values automatically converts them to 1s or 0s.

How do I remove NAs from my data?

Now that we know we have NAs in our data… how do we get rid of them?

Some functions have an easy built-in argument, na.rm, which you can set to TRUE or FALSE to remove NAs from the data to be evaluated. If you remember the example from earlier, just running sum(v) returned NA. Adding na.rm fixes this:

# Sum across vector v
sum(v, na.rm = TRUE)

## [1] 14.6

# Take the mean of our vector v
mean(v, na.rm = TRUE)

## [1] 4.866667
Note that the decision to get rid of or replace missing values rather than leaving them in as-is, is both a technical and philosophical topic of conversation and should be addressed on a case-by-case basis. There are statistical methods for replacing missing values without biasing the outcome of analyses (e.g., in multivariate ordination analyses). Many statistical tests in R will automatically remove NA values, but in other cases it makes more sense to remove them manually. Either way, this goes beyond the current scope of this post, but it is an important note to keep in mind.

If you want to remove all observations containing NAs, you can also use the na.omit() function. Keep in mind that removing an observation means removing the entire row of data.

# remove NAs from our data frame
na.omit(ex)

## example data set
## 1 1 2 4
## 3 16 1 4
## 6 6 7 8

Something else you might want to do is replace those NAs with another value. Maybe you want to replace missing values with 0 (You’re 200% sure those missing values were supposed to be 0s?? 😄), or maybe you want to replace those missing values with the mean of your data to approximate what those values would be (that can be especially useful for multivariate analyses). You can subset your vector or data frame to the places where is.na() is true, and set those equal to a new value.

# Replace NAs in data frame with 0
ex[is.na(ex)] <- 0
# View data frame
ex

## example data set
## 1 1 2 4
## 2 0 2 4
## 3 16 1 4
## 4 2 0 5
## 5 3 1 0
## 6 6 7 8

# Replace NAs in vector with the mean
v[is.na(v)] <- mean(v, na.rm = TRUE)
# View vector
v

## [1] 1.200000 4.500000 4.866667 8.900000 4.866667

Awesome! Now you know how to find NAs in your data, perform functions without letting NAs get in the way, and remove NAs from your data for further analysis. Soon these functions will come to you NAturally…haha. I hope you found this tutorial helpful. Happy coding!

P.S. I’d recommend listening to this song to put you in the NA-removing mood!



If you liked this post and want to learn more, then check out our online course on the complete basics of R for ecology:

Also be sure to check out R-bloggers for other great tutorials on learning R

To leave a comment for the author, please follow the link and comment on their blog: R on R (for ecology).

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.