Counting NAs by column in R

R | TypeThePipe

2 years ago

[This article was first published on R | TypeThePipe, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Are you starting your data exploration? Do you want to have an easy overview of your variable NA percentage?

We create a function to benchmark different ways of achieving it:

library(microbenchmark)
library(tidyverse)

benchmark_count_na_by_column <- function(dataset){
 microbenchmark(
 # Summary table output
 dataset %>% summary(),
 # Numeric output
 colSums(is.na(dataset)),
 sapply(dataset, function(x) sum(is.na(x))),
 # List output
 dataset %>% map(~sum(is.na(.))),
 lapply( dataset, function(x) sum(is.na(x))),
 # Df output
 dataset %>%
 select(everything()) %>% 
 summarise_all(funs(sum(is.na(.)))),
 dataset %>% 
 summarise_each(funs(sum(is.na(.)))),
 # Tibble output
 dataset %>% map_df(~sum(is.na(.)))
 )
}

See the performance dealing with small datasets:

print(airquality %>% nrow()) # 153 rows
benchmark_count_na_by_column(airquality)
## Unit: microseconds
##funct min lq mean median uq max neval class
##summary() 1480.5 1582.60 1979.676 1897.30 2100.45 6403.2 100 table
##colSums() 24.4 38.45 47.854 44.70 53.90 152.4 100 integer
##sapply() 23.2 35.05 67.891 39.65 50.30 2494.8 100 integer
##map() 140.2 182.60 214.092 200.75 238.50 549.6 100 list
##lapply() 11.2 15.65 27.093 18.85 22.45 750.1 100 list
##summarise_all() 1996.9 2147.80 2650.223 2382.90 2798.55 8133.7 100 data.frame
##summarise_each() 2277.9 2497.05 2951.477 2898.40 3080.65 7977.2 100 data.frame
##map_df() 190.0 249.00 331.368 275.40 326.05 383 100 tbl_df

Let’s see how well them scale with 100000 rows dataset:

big_dataset %>% nrow() # 100 000 rows
benchmark_count_na_by_column(big_dataset)
## Unit: milliseconds
##funct min lq mean median uq max neval class
##summary() 113.7535 129.35070 138.716624 133.14050 143.45920 252.0149 100 table
##colSums() 4.4280 5.31080 12.502741 5.65005 18.77570 124.8206 100 integer
##sapply() 2.2452 3.03095 6.788395 3.15310 15.04010 18.6061 100 integer
##map() 2.5950 3.28390 5.760602 3.38020 3.69445 19.4527 100 list
##lapply() 2.2018 2.95700 6.219106 3.03605 3.62860 19.5514 100 list
##summarise_all() 5.0982 5.85135 10.093431 6.05940 6.87070 127.5107 100 data.frame
##summarise_each() 5.7251 6.16980 10.191426 6.33065 6.72210 125.2943 100 data.frame
##map_df() 2.6913 3.42045 7.694863 3.56720 3.89715 122.2030 100 tbl_df

To leave a comment for the author, please follow the link and comment on their blog: R | TypeThePipe.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.