[This article was first published on R | TypeThePipe, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Are you starting your data exploration? Do you want to have an easy overview of your variable NA percentage?
We create a function to benchmark different ways of achieving it:
library(microbenchmark) library(tidyverse) benchmark_count_na_by_column <- function(dataset){ microbenchmark( # Summary table output dataset %>% summary(), # Numeric output colSums(is.na(dataset)), sapply(dataset, function(x) sum(is.na(x))), # List output dataset %>% map(~sum(is.na(.))), lapply( dataset, function(x) sum(is.na(x))), # Df output dataset %>% select(everything()) %>% summarise_all(funs(sum(is.na(.)))), dataset %>% summarise_each(funs(sum(is.na(.)))), # Tibble output dataset %>% map_df(~sum(is.na(.))) ) }
See the performance dealing with small datasets:
print(airquality %>% nrow()) # 153 rows benchmark_count_na_by_column(airquality) ## Unit: microseconds ##funct min lq mean median uq max neval class ##summary() 1480.5 1582.60 1979.676 1897.30 2100.45 6403.2 100 table ##colSums() 24.4 38.45 47.854 44.70 53.90 152.4 100 integer ##sapply() 23.2 35.05 67.891 39.65 50.30 2494.8 100 integer ##map() 140.2 182.60 214.092 200.75 238.50 549.6 100 list ##lapply() 11.2 15.65 27.093 18.85 22.45 750.1 100 list ##summarise_all() 1996.9 2147.80 2650.223 2382.90 2798.55 8133.7 100 data.frame ##summarise_each() 2277.9 2497.05 2951.477 2898.40 3080.65 7977.2 100 data.frame ##map_df() 190.0 249.00 331.368 275.40 326.05 383 100 tbl_df
Let’s see how well them scale with 100000 rows dataset:
big_dataset %>% nrow() # 100 000 rows benchmark_count_na_by_column(big_dataset) ## Unit: milliseconds ##funct min lq mean median uq max neval class ##summary() 113.7535 129.35070 138.716624 133.14050 143.45920 252.0149 100 table ##colSums() 4.4280 5.31080 12.502741 5.65005 18.77570 124.8206 100 integer ##sapply() 2.2452 3.03095 6.788395 3.15310 15.04010 18.6061 100 integer ##map() 2.5950 3.28390 5.760602 3.38020 3.69445 19.4527 100 list ##lapply() 2.2018 2.95700 6.219106 3.03605 3.62860 19.5514 100 list ##summarise_all() 5.0982 5.85135 10.093431 6.05940 6.87070 127.5107 100 data.frame ##summarise_each() 5.7251 6.16980 10.191426 6.33065 6.72210 125.2943 100 data.frame ##map_df() 2.6913 3.42045 7.694863 3.56720 3.89715 122.2030 100 tbl_df
To leave a comment for the author, please follow the link and comment on their blog: R | TypeThePipe.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.