Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Removing NA’s from a Data Frame in R
The Problem
Suppose you are doing a method comparison for which some results are above or below the linear range of your assay(s). Generally, these will appear in your spreadsheet (gasp!) program as (< x) or (> y) or, in the case of our mass spectrometer, “No Peak”. When you read these data into R using read.csv()
, R will turn then into factors, which I personally find super–annoying and which inspired this conference badge (see bottom right) as I learned from University of British Columbia prof Jenny Bryan.
For this reason, when we read the data in, it is convenient to choose the option stringsAsFactors = FALSE
. In doing so, the data will be treated as strings and be in the character class. But for regression comparison purposes, we need to make the data numeric and all of the (< x) and (> y) results will be converted to NA. In this post, we want to address a few questions that follow:
- How do we find all the NA results?
- How can we replace them with a numeric (like 0)?
- How can we rid ourselves of rows containing NA?
Finding NA's
Let's read in the data which comes from a method comparison of serum aldosterone between our laboratory and Russ Grant's laboratory (LabCorp) published here. I'll read in the data with stringsAsFactors = FALSE. These are aldosterone results in pmol/L. To convert to ng/dL, divide by 27.7.
myData<-read.csv("Comparison.csv", sep=",", stringsAsFactors = FALSE) str(myData)
## 'data.frame': 96 obs. of 3 variables: ## $ Sample.Num: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Aldo.Us : chr "462.3" "433.2" "37.7" "137.7" ... ## $ Aldo.Them : num 457.2 418.1 42.1 133.9 27.4 ...
head(myData)
## Sample.Num Aldo.Us Aldo.Them ## 1 1 462.3 457.2 ## 2 2 433.2 418.1 ## 3 3 37.7 42.1 ## 4 4 137.7 133.9 ## 5 5 29.4 27.4 ## 6 6 552.1 639.7
You can see the problem immediately, our data (“Aldo.Us”) is a character vector. This is not good for regression. Why did this happen? We can find out:
myData$Aldo.Us
## [1] "462.3" "433.2" "37.7" "137.7" "29.4" "552.1" "41.6" ## [8] "158.7" "1198" "478.4" "160.7" "167.9" "211.6" "493.3" ## [15] "195.6" "649.8" "644" "534.1" "212.7" "413.3" "150.7" ## [22] "451.2" "25.8" "118.8" "496.1" "486.1" "846.8" "139.9" ## [29] "No Peak" "98.3" "113.8" "230.7" "530.2" "26.6" "390.3" ## [36] "782.8" "886.7" "83.4" "44" "71.2" "657" "321.6" ## [43] "188.6" "451.2" "485.3" "No Peak" "144.9" "249.6" "682" ## [50] "601.9" "330.5" "216.6" "500.3" "20.5" "271.5" "196.7" ## [57] "309.4" "235.7" "171.7" "124.9" "293.6" "345.4" "243.5" ## [64] "75.1" "508.3" "442.4" "531.3" "317.4" "647.9" "562" ## [71] "366.5" "37.1" "231.6" "73.7" "526.3" "No Peak" "165.6" ## [78] "105.8" "77.8" "211.6" "125.8" "76.5" "58.2" "111.9" ## [85] "238.5" "31.6" "156.8" "191.7" "402.5" "108.9" "183.7" ## [92] "314.4" "90" "98.9" "144.9" "971.4"
Ahhh…it's the dreaded “No Peak”. This is what the mass spectrometer has put in its data file. So, let's force everything to numeric:
myData$Aldo.Us <- as.numeric(myData$Aldo.Us)
## Warning: NAs introduced by coercion
We see the warnings about the introduction of NAs. And we get:
myData$Aldo.Us
## [1] 462.3 433.2 37.7 137.7 29.4 552.1 41.6 158.7 1198.0 478.4 ## [11] 160.7 167.9 211.6 493.3 195.6 649.8 644.0 534.1 212.7 413.3 ## [21] 150.7 451.2 25.8 118.8 496.1 486.1 846.8 139.9 NA 98.3 ## [31] 113.8 230.7 530.2 26.6 390.3 782.8 886.7 83.4 44.0 71.2 ## [41] 657.0 321.6 188.6 451.2 485.3 NA 144.9 249.6 682.0 601.9 ## [51] 330.5 216.6 500.3 20.5 271.5 196.7 309.4 235.7 171.7 124.9 ## [61] 293.6 345.4 243.5 75.1 508.3 442.4 531.3 317.4 647.9 562.0 ## [71] 366.5 37.1 231.6 73.7 526.3 NA 165.6 105.8 77.8 211.6 ## [81] 125.8 76.5 58.2 111.9 238.5 31.6 156.8 191.7 402.5 108.9 ## [91] 183.7 314.4 90.0 98.9 144.9 971.4
summary(myData$Aldo.Us)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 20.5 118.8 230.7 305.5 478.4 1198.0 3
Now we have 3 NAs. We want to find them and get rid of them. From the screen we could figure out where the NAs were and manually replace them. This is OK on such a small data set but when you start dealing with data sets having thousands or millions of rows, approaches like this are impractical. So, let's do it right.
If we naively try to use an equality we find out nothing.
which(myData$Aldo.Us==NA)
## integer(0)
Hunh? Whasgoinon?
This occurs because NA means “unknown”. Think about it this way. If one patient's result is NA and another patient's result is NA, then are the results equal? No, they are not (necessarily) equal, they are both unknown and so the comparison should be unknown also. This is why we do not get a result of TRUE when we ask the following question:
NA==NA
## [1] NA
So, when we ask R if unknown #1 is equal to unknown #2, it responds with “I dunno.”, or “NA”. So if we want to find the NAs, we should inquire as follows:
is.na(myData$Aldo.Us)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
or, for less verbose output:
which(is.na(myData$Aldo.Us))
## [1] 29 46 76
Hey Hey! Ho Ho! Those NAs have got to go!
Now we know where they are, in rows 29, 46, and 76. We can replace them with 0, which is OK but may pose problems if we use weighted regression (i.e. if we have a 0 in the x-data and we weight data by 1/x). Alternatively, we can delete the rows entirely.
To replace them with 0, we can write:
myData$Aldo.Us[which(is.na(myData$Aldo.Us))] <- 0
and this is equivalent:
myData$Aldo.Us[is.na(myData$Aldo.Us)] <- 0
To remove the whole corresponding row, we can write:
myDataBeGoneNA <- myData[-which(is.na(myData$Aldo.Us)),]
or:
myDataBeGoneNA <- myData[!is.na(myData$Aldo.Us),]
Complete Cases
What if there were NA's hiding all over the place in multiple columns and we wanted to banish any row containing one or more NA? In this case, the complete.cases()
function is one way to go:
complete.cases(myData)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE ## [34] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [45] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [67] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE ## [78] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [89] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
This function shows us which rows have no NAs (the ones with TRUE as the result) and which rows have NAs (the three with FALSE). We can banish all rows containing any NAs generally as follows:
myDataBeGoneNA <- myData[complete.cases(myData),]
This data set now has 93 rows:
nrow(myDataBeGoneNA)
## [1] 93
You could peruse the excluded data like this:
myData[!complete.cases(myData),]
## Sample.Num Aldo.Us Aldo.Them ## 29 29 NA 6.6 ## 46 46 NA 7.0 ## 76 76 NA 5.7
na.omit()
Another way to remove incomplete cases is the na.omit()
function (as Dr. Shannon Haymond pointed out to me). So this works too:
myDataBeGoneNA <- na.omit(myData)
Row Numbers are Actually Names
In all of these approaches, you will notice something peculiar. Even though we have excluded the three rows, the row numbering still appears to imply that there are 96 rows:
tail(myDataBeGoneNA)
## Sample.Num Aldo.Us Aldo.Them ## 91 91 183.7 170.4 ## 92 92 314.4 307.6 ## 93 93 90.0 214.0 ## 94 94 98.9 75.1 ## 95 95 144.9 129.3 ## 96 96 971.4 807.7
but if you check the dimensions, there are 93 rows:
nrow(myDataBeGoneNA)
## [1] 93
Why? This is because the row numbers are not row numbers; they are numerical row names. When you exclude a row, none of the other row names change. This was bewildering to me in the beginning. I thought my exclusions had failed somehow.
Now we can move on
Once this is done, you can go on and do your regression, which, in this case, looks like this.
Finally, if you are ever wondering what fraction of your data is comprised of NA, rather than the absolute number, you can do this as follows:
mean(is.na(myData$Aldo.Us))
## [1] 0.03125
If you applied this to the whole dataframe, you get the fraction of NA's in the whole dataframe (again–thank you Shannon):
mean(is.na(myData))
## [1] 0.01041667
Final Thought:
is.na(newunderthesun)
## [1] TRUE
-Dan
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.