Handling Missing Values in R using tidyr
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, We’ll see 3 functions from tidyr
that’s useful for handling Missing Values (NA
s) in the dataset. Please note: This post isn’t going to be about Missing Value Imputation.
tidyr
According to the documentation of tidyr,
The goal of tidyr is to help you create tidy data. Tidy data is data where:
+ Every column is variable. + Every row is an observation.. + Every cell is a single value.
Let’s start with loading tidyr
library. tidyr
is also one of the packages present in tidyverse
.
library(tidyr)
tidyr functions
Following are the 3 tidyr functions that are handy for processing Missing Values
- drop_na()
- fill()
- replace_na()
Dataset with Missing Value
To get a dataset with missing values, let’s take mtcars
and make some missing values in it.
df <- mtcars df$hp[2] <- NA df$cyl[5] <- NA df$gear[5] <- NA df$mpg[10] <- NA # counting number of missing values paste("Number of Missing Values", sum(is.na(df))) ## [1] "Number of Missing Values 4" # dimensions paste("Number of Rows",nrow(df)) ## [1] "Number of Rows 32" paste("Number of Columns",ncol(df)) ## [1] "Number of Columns 11"
Now that we’ve got a dataset with Missing Values (NA
s) in it.
drop_na()
drop_na()
drops/removes the rows/entries with Missing Values
library(dplyr) #just in-case if we need to some dplyr verbs ## Warning: package 'dplyr' was built under R version 3.5.2 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union df_no_na <- drop_na(df) # counting number of missing values paste("Number of Missing Values", sum(is.na(df_no_na))) ## [1] "Number of Missing Values 0" # dimensions paste("Number of Rows",nrow(df_no_na)) ## [1] "Number of Rows 29" paste("Number of Columns",ncol(df_no_na)) ## [1] "Number of Columns 11"
fill()
fill()
fills the NA
s (missing values) in selected columns (dplyr::select()
options could be used like in the below example with everything()
).
It also lets us select the .direction
either down
(default) or up
or updown
or downup
from where the missing value must be filled.
Quite Naive, but could be handy in a lot of instances like let’s say Time Series data.
df_na_filled <- df %>% fill( dplyr::everything() ) # counting number of missing values paste("Number of Missing Values", sum(is.na(df_na_filled))) ## [1] "Number of Missing Values 0" # dimensions paste("Number of Rows",nrow(df_na_filled)) ## [1] "Number of Rows 32" paste("Number of Columns",ncol(df_na_filled)) ## [1] "Number of Columns 11"
replace_na()
replace_na()
is to be used when you have got the replacement value which the NA
s should be filled with.
Below is an example of how we have replaced all NA
s with just zero (0
)
df_na_replaced <- df %>% mutate_all(replace_na,0) # counting number of missing values paste("Number of Missing Values", sum(is.na(df_na_replaced))) ## [1] "Number of Missing Values 0" # dimensions paste("Number of Rows",nrow(df_na_replaced)) ## [1] "Number of Rows 32" paste("Number of Columns",ncol(df_na_replaced)) ## [1] "Number of Columns 11"
Alternatively, We could’ve simply identified numeric / continous values and replaced their values with NA
s like this:
df_na_replaced <- df %>% mutate_if(is.numeric, replace_na,0)
Hopefully, this post would have thrown some light on those three functions of tidyr
to handle missing values: drop_na()
, fill()
, replace_na()
.
If you liked this, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.