Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data wrangling. It’s the process of getting your raw data transformed into a format that’s easier to work with for analysis.
It’s not the sexiest or the most exciting work.
In our dreams, all datasets come to us perfectly formatted and ready for all kinds of sophisticated analysis! In real life, not so much.
It’s estimated that as much as 75% of a data scientist’s time is spent data wrangling. To be an effective data scientist, you need to be good at this, and you need to be FAST.
One of the most basic data wrangling tasks is filtering data. Starting from a large dataset, and reducing it to a smaller, more manageable dataset, based on some criteria.
Think of filtering your sock drawer by color, and pulling out only the black socks.
Whenever I need to filter in R, I turn to the dplyr filter
function.
As is often the case in programming, there are many ways to filter in R. But the dplyr filter
function is by far my favorite, and it’s the method I use the vast majority of the time.
Why do I like it so much? It has a user-friendly syntax, is easy to work with, and it plays very nicely with the other dplyr functions.
A brief introduction to dplyr
Before I go into detail on the dplyr filter
function, I want to briefly introduce dplyr as a whole to give you some context.
dplyr is a cohesive set of data manipulation functions that will help make your data wrangling as painless as possible.
dplyr, at its core, consists of 5 functions, all serving a distinct data wrangling purpose:
filter()
selects rows based on their valuesmutate()
creates new variablesselect()
picks columns by namesummarise()
calculates summary statisticsarrange()
sorts the rows
The beauty of dplyr is that the syntax of all of these functions is very similar, and they all work together nicely.
If you master these 5 functions, you’ll be able to handle nearly any data wrangling task that comes your way. But we need to tackle them one at a time, so now: let’s learn to filter in R using dplyr!
Loading Our Data
In this post, I’ll be using the diamonds
dataset, a dataset built into the ggplot package, to illustrate the best use of the dplyr filter
function. To start, let’s take a look at the data:
library(dplyr) library(ggplot2) diamonds ## # A tibble: 53,940 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## # … with 53,930 more rows
We can see that the dataset gives characteristics of individual diamonds, including their carat, cut, color, clarity, and price.
Our First dplyr Filter Operation
I’m a big fan of learning by doing, so we’re going to dive in right now with our first dplyr filter
operation.
From our diamonds
dataset, we’re going to filter only those rows where the diamond cut is ‘Ideal’:
filter(diamonds, cut == 'Ideal') ## # A tibble: 21,551 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # … with 21,541 more rows
As you can see, every diamond in the returned data frame is showing a cut of ‘Ideal’. It worked! We’ll cover exactly what’s happening here in more detail, but first let’s briefly review how R works with logical and relational operators, and how we can use those to efficiently filter in R.
A brief aside on logical and relational operators in R and dplyr
In dplyr, filter takes in 2 arguments:
- The dataframe you are operating on
- A conditional expression that evaluates to
TRUE
orFALSE
In the example above, we specified diamonds
as the dataframe, and cut == 'Ideal'
as the conditional expression
Conditional expression? What am I talking about?
Under the hood, dplyr filter
works by testing each row against your conditional expression and mapping the results to TRUE
and FALSE
. It then selects all rows that evaluate to TRUE
.
In our first example above, we checked that the diamond cut was Ideal with the conditional expression cut == 'Ideal'
. For each row in our data frame, dplyr checked whether the column cut
was set to 'Ideal'
, and returned only those rows where cut == 'Ideal'
evaluated to TRUE
.
In our first filter, we used the operator ==
to test for equality. That’s not the only way we can use dplyr to filter our data frame, however. We can use a number of different relational operators to filter in R.
Relational operators are used to compare values. In R generally (and in dplyr specifically), those are:
==
(Equal to)!=
(Not equal to)<
(Less than)<=
(Less than or equal to)>
(Greater than)>=
(Greater than or equal to)
These are standard mathematical operators you’re used to, and they work as you’d expect. One quick note: make sure you use the double equals sign (==
) for comparisons! By convention, a single equals sign (=
) is used to assign a value to a variable, and a double equals sign (==
) is used to check whether two values are equal. Using a single equals sign will often give an error message that is not intuitive, so make sure you check for this common error!
dplyr can also make use of the following logical operators to string together multiple different conditions in a single dplyr filter
call!
!
(logical NOT)&
(logical AND)|
(logical OR)
There are two additional operators that will often be useful when working with dplyr to filter:
%in%
(Checks if a value is in an array of multiple values)is.na()
(Checks whether a value is NA)
In our first example above, we tested for equality when we said cut == 'Ideal'
. Now, let’s expand our capabilities with different relational parameters in our filter:
filter(diamonds, price > 2000) ## # A tibble: 29,733 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.7 Ideal E SI1 62.5 57 2757 5.7 5.72 3.57 ## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52 ## 3 0.7 Ideal G VS2 61.6 56 2757 5.7 5.67 3.5 ## 4 0.71 Very Good E VS2 62.4 57 2759 5.68 5.73 3.56 ## 5 0.78 Very Good G SI2 63.8 56 2759 5.81 5.85 3.72 ## 6 0.7 Good E VS2 57.5 58 2759 5.85 5.9 3.38 ## 7 0.7 Good F VS1 59.4 62 2759 5.71 5.76 3.4 ## 8 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07 ## 9 0.73 Very Good E SI1 61.6 59 2760 5.77 5.78 3.56 ## 10 0.8 Premium H SI1 61.5 58 2760 5.97 5.93 3.66 ## # … with 29,723 more rows
Here, we select only the diamonds where the price is greater than 2000.
filter(diamonds, cut != 'Ideal') ## # A tibble: 32,389 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 4 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 6 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 7 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 8 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 ## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 ## # … with 32,379 more rows
And here, we select all the diamonds whose cut is NOT equal to ‘Ideal’. Note that this is the exact opposite of what we filtered before.
You can use <
, >
, <=
, >=
, ==
, and !=
in similar ways to filter your data. Try a few examples on your own to get comfortable with the different filtering options!
A note on storing your results
By default, dplyr filter will perform the operation you ask and then print the result to the screen. If you prefer to store the result in a variable, you’ll need to assign it as follows:
e_diamonds <- filter(diamonds, color == 'E')
Note that you can also overwrite the dataset (that is, assign the result back to the diamonds
data frame) if you don’t want to retain the unfiltered data. In this case I want to keep it, so I’ll store this result in e_diamonds
. In any case, it’s always a good idea to preview your dplyr filter
results before you overwrite any data!
Filtering Numeric Variables
Numeric variables are the quantitative variables in a dataset. In the diamonds dataset, this includes the variables carat and price, among others. When working with numeric variables, it is easy to filter based on ranges of values. For example, if we wanted to get any diamonds priced between 1000 and 1500, we could easily filter as follows:
filter(diamonds, price >= 1000 & price <= 1500) ## # A tibble: 5,511 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.38 Very Good E VVS2 61.8 56 1000 4.66 4.68 2.88 ## 2 0.39 Very Good F VS1 57.1 61 1000 4.86 4.91 2.79 ## 3 0.38 Very Good E VS1 61.5 58 1000 4.64 4.69 2.87 ## 4 0.38 Premium E VS1 60.7 59 1000 4.65 4.7 2.84 ## 5 0.38 Ideal E VS1 61.6 56 1000 4.65 4.67 2.87 ## 6 0.53 Very Good G SI2 62.5 55 1000 5.14 5.19 3.23 ## 7 0.570 Very Good I SI2 62.1 57 1000 5.29 5.33 3.3 ## 8 0.38 Ideal E VS1 61.9 56 1000 4.63 4.67 2.88 ## 9 0.5 Good E SI2 63.2 61 1000 5.02 5.05 3.18 ## 10 0.3 Ideal D VVS1 61.3 57 1000 4.29 4.32 2.64 ## # … with 5,501 more rows
In general, when working with numeric variables, you’ll most often make use of the inequality operators, >
, <
, >=
, and <=
. While it is possible to use the ==
and !=
operators with numeric variables, I generally recommend against it.
The issue with using ==
is that it will only return true of the value is exactly equal to what you’re testing for. If the dataset you’re testing against consists of integers, this is possible, but if you’re dealing with decimals, this will often break down. For example, 1.0100000001 == 1.01
will evaluate to FALSE
. This is technically true, but it’s easy to get into trouble with decimal precision. I never use ==
when working with numerical variables unless the data I am working with consists of integers only!
Filtering Categorical Variables
Categorical variables are non-quantitative variables. In our example dataset, the columns cut, color, and clarity are categorical variables. In contrast to numerical variables, the inequalities >
, <
, >=
and <=
have no meaning here. Instead, you’ll make frequent use of the ==
, !=
, and %in%
operators when filtering categorical variables.
Above, we filtered the dataset to include only the diamonds whose cut was Ideal using the ==
operator. Let’s say that we wanted to expand this filter to also include diamonds where the cut is Premium. To accomplish this, we would use the %in%
operator:
filter(diamonds, cut %in% c('Ideal', 'Premium')) ## # A tibble: 35,342 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 4 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 5 0.22 Premium F SI1 60.4 61 342 3.88 3.84 2.33 ## 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 7 0.2 Premium E SI2 60.2 62 345 3.79 3.75 2.27 ## 8 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68 ## 9 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 10 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47 ## # … with 35,332 more rows
How does this work? First, we create a vector of our desired cut options, c('Ideal', 'Premium')
. Then, we use %in% to filter only those diamonds whose cut is in that vector. dplyr will filter out BOTH those diamonds whose cut is Ideal AND those diamonds whose cut is Premium. The vector you check against for the %in% function can be arbitrarily long, which can be very useful when working with categorical data.
It’s also important to note that the vector can be defined before you perform the dplyr filter operation:
cuts_to_include <- c('Good', 'Very Good', 'Ideal', 'Premium') filter(diamonds, cut %in% cuts_to_include) ## # A tibble: 52,330 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 ## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 ## 9 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 ## 10 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 ## # … with 52,320 more rows
This helps to increase the readability of your code when you’re filtering against a larger set of potential options. This also means that if you have an existing vector of options from another source, you can use this to filter your dataset. This can come in very useful as you start working with multiple datasets in a single analysis!
Chaining together multiple filtering operations with logical operators
The real power of the dplyr filter function is in its flexibility. Using the logical operators &, |, and !, we can group many filtering operations in a single command to get the exact dataset we want!
Let’s say we want to select all diamonds where the cut is Ideal and the carat is greater than 1:
filter(diamonds, cut == 'Ideal' & carat > 1) ## # A tibble: 5,662 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 1.01 Ideal I I1 61.5 57 2844 6.45 6.46 3.97 ## 2 1.02 Ideal H SI2 61.6 55 2856 6.49 6.43 3.98 ## 3 1.02 Ideal I I1 61.7 56 2872 6.44 6.49 3.99 ## 4 1.02 Ideal J SI2 60.3 54 2879 6.53 6.5 3.93 ## 5 1.01 Ideal I I1 61.5 57 2896 6.46 6.45 3.97 ## 6 1.02 Ideal I I1 61.7 56 2925 6.49 6.44 3.99 ## 7 1.14 Ideal J SI1 60.2 57 3045 6.81 6.71 4.07 ## 8 1.02 Ideal H SI2 58.8 57 3142 6.61 6.55 3.87 ## 9 1.06 Ideal I SI2 62.8 55 3146 6.51 6.46 4.07 ## 10 1.02 Ideal I VS2 62.8 57 3148 6.45 6.39 4.03 ## # … with 5,652 more rows
BOTH conditions must evaluate to TRUE
for the data to be selected. That is, the cut must be Ideal, and the carat must be greater than 1.
You don’t need to limit yourself to two conditions either. You can have as many as you want! Let’s say we also wanted to make sure the color of the diamond was E. We can extend our example:
filter(diamonds, cut == 'Ideal' & carat > 1 & color == 'E') ## # A tibble: 531 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 1.25 Ideal E I1 60.9 56 3276 6.95 6.91 4.22 ## 2 1.01 Ideal E I1 62 57 3388 6.37 6.41 3.96 ## 3 1.01 Ideal E I1 62 57 3450 6.41 6.37 3.96 ## 4 1.02 Ideal E SI2 62.3 56 3455 6.42 6.37 3.98 ## 5 1.04 Ideal E SI2 59 57 3588 6.65 6.6 3.91 ## 6 1.13 Ideal E I1 62 55 3729 6.66 6.7 4.14 ## 7 1.09 Ideal E SI2 59.4 57 3760 6.74 6.65 3.98 ## 8 1.13 Ideal E I1 62 55 3797 6.7 6.66 4.14 ## 9 1.12 Ideal E SI2 60.9 57 3864 6.66 6.6 4.04 ## 10 1.1 Ideal E I1 61.9 56 3872 6.59 6.63 4.09 ## # … with 521 more rows
What if we wanted to select rows where the cut is ideal OR the carat is greater than 1? Then we’d use the | operator!
filter(diamonds, cut == 'Ideal' | carat > 1) ## # A tibble: 33,391 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # … with 33,381 more rows
Any time you want to filter your dataset based on some combination of logical statements, this is possibly using the dplyr filter
function and R’s built-in logical parameters. You just need to figure out how to combine your logical expressions to get exactly what you want!
Conclusion
dplyr filter
is one of my most-used functions in R in general, and especially when I am looking to filter in R. With this article you should have a solid overview of how to filter a dataset, whether your variables are numerical, categorical, or a mix of both. Practice what you learned right now to make sure you cement your understanding of how to effectively filter in R using dplyr!
Did you find this post useful? I frequently write tutorials like this one to help you learn new skills and improve your data science. If you want to be notified of new tutorials, sign up here!
I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my consulting website or contact me at michael@michaeltothanalytics.com!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.