Filtering a data frame by condition on multiple columns
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Some times you need to filter a data frame applying the same condition over multiple columns. Obviously you could explicitly write the condition over every column, but that’s not very handy.
For those situations, it is much better to use filter_at
in combination with all_vars
.
Imagine we have the famous iris dataset with some attributes missing and want to get rid of those observations with any missing value.
# # A tibble: 10 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 1 5.1 NA 1.4 0.2 setosa # 2 2 NA 3 1.4 NA setosa # 3 3 4.7 3.2 1.3 0.2 setosa # 4 4 NA 3.1 1.5 0.2 setosa # 5 5 5 3.6 1.4 0.2 setosa # 6 6 5.4 3.9 1.7 0.4 setosa # 7 7 4.6 3.4 1.4 0.3 setosa # 8 8 NA 3.4 1.5 0.2 setosa # 9 9 4.4 2.9 1.4 0.2 setosa # 10 10 NA NA NA NA setosa
We could write the condition on every column, but that would cumbersome:
iris %>% filter(!is.na(Sepal.Length) & !is.na(Sepal.Width) & !is.na(Petal.Length) & !is.na(Petal.Width))
Instead, we just have to select the columns we will filter on and apply the condition:
features <- iris %>% names() %>% keep(~ str_detect(.,"[.]")) iris %>% filter_at(vars(features), all_vars(!is.na(.))) # # A tibble: 5 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 3 4.7 3.2 1.3 0.2 setosa # 2 5 5 3.6 1.4 0.2 setosa # 3 6 5.4 3.9 1.7 0.4 setosa # 4 7 4.6 3.4 1.4 0.3 setosa # 5 9 4.4 2.9 1.4 0.2 setosa
Here we have used the function all_vars
in the predicate to explicit that
every feature must satisfy the condition.
To be honest, for that purpose it would have been easier to simply use iris %>% na.omit()
.
But what if we wanted the opposite? Keeping only the rows with all the selected features missing is as easy as changing the predicate part:
iris %>% filter_at(vars(features), all_vars(is.na(.))) # # A tibble: 1 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 10 NA NA NA NA setosa
Another option is to apply the condition on any feature. That’s where any_vars
comes handy. Here we keep only the observations with at least one missing feature:
iris %>% filter_at(vars(features), any_vars(is.na(.))) # # A tibble: 5 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 1 5.1 NA 1.4 0.2 setosa # 2 2 NA 3 1.4 NA setosa # 3 4 NA 3.1 1.5 0.2 setosa # 4 8 NA 3.4 1.5 0.2 setosa # 5 10 NA NA NA NA setosa
Also, there are some other fancy ways to manipulate data frames with the filter family. One trick is using contains()
or starts_with()
to select the variables:
iris %>% filter_at(vars(contains("Length")), all_vars(. >= 1.4)) # # A tibble: 5 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 1 5.1 NA 1.4 0.2 setosa # 2 5 5 3.6 1.4 0.2 setosa # 3 6 5.4 3.9 1.7 0.4 setosa # 4 7 4.6 3.4 1.4 0.3 setosa # 5 9 4.4 2.9 1.4 0.2 setosa
Another example is applying the condition on columns that satisfy certain condition with filter_if
(notice the rowid
fetaure here):
iris %>% filter_if(is.numeric, any_vars(. > 5)) # # A tibble: 6 x 6 # rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <int> <dbl> <dbl> <dbl> <dbl> <fct> # 1 1 5.1 NA 1.4 0.2 setosa # 2 6 5.4 3.9 1.7 0.4 setosa # 3 7 4.6 3.4 1.4 0.3 setosa # 4 8 NA 3.4 1.5 0.2 setosa # 5 9 4.4 2.9 1.4 0.2 setosa # 6 10 NA NA NA NA setosa
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.