Introduction to missing data (NAs) in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As many of us know, science is not a perfect process. Maybe you can’t get out in the field on a certain day. Maybe you can only sample a portion of what needs to get done. Or maybe you’re downloading public data sets and they aren’t lining up perfectly. All of these can result in missing data, which can be a real pain when it comes time for analysis.
Another common source of missing data, especially when recording species abundance data in community ecology, is when you forget to write a ‘0’ and instead leave the entry blank. In the moment you might know that blank entries mean zero, but give it just a few weeks and you’ll be scratching your head! In those cases it’s often best to label those entries as unknown or missing.
In this tutorial, I’m going to explain what exactly an NA
value is, how you can find NA
s in your data, and how you can remove them.
What does it mean to have NAs in my data?
NA
s represent missing values in R. This is pretty common if you’re importing data from Excel and have some empty cells in the spreadsheet. When you load the data into R, the empty cells will be populated with NA
s.
NA
(which stands for ‘Not Available’) in R. In fact, you’ll notice the color change when you type NA
in your code since R already knows what that means.
# Read in an example data set with NAs ex <- read.csv("example_data.csv") # View data ex ## example data set ## 1 1 2 4 ## 2 NA 2 4 ## 3 16 1 4 ## 4 2 NA 5 ## 5 3 1 NA ## 6 6 7 8
Click here to download the example_data.csv
file if you want to follow along.
NA
s cannot be treated like other types of data (e.g, strings, numeric values). For example, you can’t perform math with them or use them in logical comparisons. If you do so, all you’ll get is an NA
. In the following examples, all positions in the vector with NA
just return NA
again, no matter what operation is performed. We also get NA
if we use mathematical functions such as sum()
on the vector, because R can’t add NA
s.
# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA) # Can we do math with NAs? v + 1 ## [1] 2.2 5.5 NA 9.9 NA sum(v) ## [1] NA # Can we perform logical comparisons? v < 7 ## [1] TRUE TRUE NA FALSE NA v == 4.5 ## [1] FALSE TRUE NA FALSE NA
And the reason of course is simple… What’s the answer to 5 + 'some unknown number'
?
Have you figured it out yet?
The answer is 'some unknown number'
! 😄
Thus: 5 + NA = NA
How can I detect NAs in my data?
So how can we see if we have NA
s in our data? We normally use ==
to see if a value is equal to another one. Let’s see if that will work on our vector. We know that there’s an NA
in the 3rd position of our vector.
# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA)
So theoretically, v == NA
should return FALSE FALSE TRUE FALSE TRUE
.
# Are there any NAs in our vector? v == NA ## [1] NA NA NA NA NA
But this code just gives us NA
s. Unfortunately, NA
s don’t work with any kind of logical operator either.
Same as with math operations, NA
is just a placeholder for 'I don't know the real value'
, so asking does NA == NA
, is the same as saying does 'some unknown number' == 'some unknown number'
, which clearly has no known answer.
Luckily, R gives us a special function to detect NA
s. This is the is.na()
function. And actually, if you try to type my_vector == NA
, R will tell you to use is.na()
instead.
is.na()
will work on individual values, vectors, lists, and data frames. It will return TRUE
or FALSE
where you have an NA
or where you don’t.
# Which values in my vector are NA? is.na(v) ## [1] FALSE FALSE TRUE FALSE TRUE # Which values in my data frame are NA? is.na(ex) ## example data set ## [1,] FALSE FALSE FALSE ## [2,] TRUE FALSE FALSE ## [3,] FALSE FALSE FALSE ## [4,] FALSE TRUE FALSE ## [5,] FALSE FALSE TRUE ## [6,] FALSE FALSE FALSE
You can also combine is.na()
with sum()
and which()
to figure out how many NA
s you have and where they’re located.
# How many NAs in my data frame? sum(is.na(ex)) ## [1] 3 # Which row contains an NA in the 'data' column? which(is.na(ex$data)) ## [1] 4 # Which vector positions contain NAs? which(is.na(v)) ## [1] 3 5
sum(is.na(ex))
works is because is.na()
first converts your values to TRUE
or FALSE
, and applying math operations to T/F values automatically converts them to 1s or 0s.
How do I remove NAs from my data?
Now that we know we have NA
s in our data… how do we get rid of them?
Some functions have an easy built-in argument, na.rm
, which you can set to TRUE
or FALSE
to remove NA
s from the data to be evaluated. If you remember the example from earlier, just running sum(v)
returned NA
. Adding na.rm
fixes this:
# Sum across vector v sum(v, na.rm = TRUE) ## [1] 14.6 # Take the mean of our vector v mean(v, na.rm = TRUE) ## [1] 4.866667
NA
values, but in other cases it makes more sense to remove them manually. Either way, this goes beyond the current scope of this post, but it is an important note to keep in mind.
If you want to remove all observations containing NA
s, you can also use the na.omit()
function. Keep in mind that removing an observation means removing the entire row of data.
# remove NAs from our data frame na.omit(ex) ## example data set ## 1 1 2 4 ## 3 16 1 4 ## 6 6 7 8
Something else you might want to do is replace those NA
s with another value. Maybe you want to replace missing values with 0 (You’re 200% sure those missing values were supposed to be 0s?? 😄), or maybe you want to replace those missing values with the mean of your data to approximate what those values would be (that can be especially useful for multivariate analyses). You can subset your vector or data frame to the places where is.na()
is true, and set those equal to a new value.
# Replace NAs in data frame with 0 ex[is.na(ex)] <- 0 # View data frame ex ## example data set ## 1 1 2 4 ## 2 0 2 4 ## 3 16 1 4 ## 4 2 0 5 ## 5 3 1 0 ## 6 6 7 8 # Replace NAs in vector with the mean v[is.na(v)] <- mean(v, na.rm = TRUE) # View vector v ## [1] 1.200000 4.500000 4.866667 8.900000 4.866667
Awesome! Now you know how to find NA
s in your data, perform functions without letting NA
s get in the way, and remove NA
s from your data for further analysis. Soon these functions will come to you NA
turally…haha. I hope you found this tutorial helpful. Happy coding!
P.S. I’d recommend listening to this song to put you in the NA
-removing mood!
Also be sure to check out R-bloggers for other great tutorials on learning R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.