Filter data frame rows
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We often want to operate only on a specific subset of rows of a data frame. The dplyr filter()
function provides a flexible way to extract the rows of interest based on multiple conditions.
- Use the
filter()
function to sort out the rows of a data frame that fulfill a specified condition - Filter a data frame by multiple conditions
filter(my_data_frame, condition) filter(my_data_frame, condition_one, condition_two, ...)
The filter() function
filter(my_data_frame, condition) filter(my_data_frame, condition_one, condition_two, ...)
The filter()
function takes a data frame and one or more filtering expressions as input parameters. It processes the data frame and keeps only the rows that fulfill the defined filtering expressions. These expressions can be seen as rules for the evaluation and keeping of rows. In the majority of the cases, they are based on relational operators. As an example, we could filter the pres_results
data frame and keep only the rows, where the state
variable is equal to "CA"
(California):
filter(pres_results, state == "CA") # A tibble: 11 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 1976 CA 7803770 0.480 0.497 0.0230 2 1980 CA 8582938 0.359 0.527 0.114 3 1984 CA 9505041 0.413 0.575 0.0122 4 1988 CA 9887065 0.476 0.511 0.0131 5 1992 CA 11131721 0.460 0.326 0.213 6 1996 CA 10019469 0.511 0.382 0.107 7 2000 CA 10965822 0.534 0.417 0.0490 8 2004 CA 12421353 0.543 0.444 0.0117 9 2008 CA 13561900 0.610 0.370 0.0188 10 2012 CA 13038547 0.602 0.371 0.0246 11 2016 CA 14181595 0.617 0.316 0.0581
In the output, we can compare the election results in California for different years.
As another example, we could filter the pres_results
data frame and keep only those rows, where the dem
variable (percentage of votes for the Democratic Party) is greater than 0.85:
filter(pres_results, dem > 0.85) # A tibble: 7 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 1984 DC 211288 0.854 0.137 0.00886 2 1996 DC 185726 0.852 0.0934 0.0513 3 2000 DC 201894 0.852 0.0895 0.0563 4 2004 DC 227586 0.892 0.0934 0.0125 5 2008 DC 265853 0.925 0.0653 0.00582 6 2012 DC 293764 0.909 0.0728 0.0155 7 2016 DC 312575 0.905 0.0407 0.0335
In the output we can see for each election year the states where the Democratic Party got over 85% of the votes. Based on the results, we could say that the Democratic Party has a solid voter base in the District of Columbia (known as Washington, D.C.).
Exercise: Use filter() with a single expression
The gapminder
dataset contains economic and demographic data about various countries since 1952.
Inspect the data for a single year by using the filter()
function.
- Apply the
filter()
function on thegapminder
dataset - Keep only the rows where the
year
is equal to 2007
Note that the dplyr and gapminder packages are already loaded.
Start ExerciseQuiz: filter() Function
Which of the following statements about thefilter()
function are correct?
- Relational operators, such as
==
or>
, are frequently part of the filtering expressions. - The
filter()
function comes in the dplyr package. - Only numeric variables can be filtered.
- The
filter()
function works only on data frames, not on tibbles.
Multiple filter expressions
filter(my_data_frame, condition) filter(my_data_frame, condition_one, condition_two, ...)
The filter()
function can take multiple filtering rules as input as well. These can be seen as a combination of rules with the &
operator. In order for a row to be included in the output, all filtering rules must be fulfilled by it. In the following example, we filter the pres_results
data frame for all rows where the state
variable is equal to "CA"
and the year
variable is equal to 2016:
filter(pres_results, state == "CA", year==2016) # A tibble: 1 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 2016 CA 14181595 0.617 0.316 0.0581
We get a single row as output, containing the 2016 US presidential election results for California state.
Exercise: Use filter() with multiple rules
The gapminder
dataset contains economic and demographic data about various countries since 1952. Filter the tibble and inspect which countries had a life expectancy over 80 years in the year 2007! The required packages are already loaded.
- Use the
filter()
function on the gapminder tibble. - Filter all rows where the
year
variable is equal to 2007 and the life expectancylifeExp
is greater than 80!
Exercise
The gapminder
dataset contains economic and demographic data about various countries since 1952. Filter the gapminder
tibble and inspect which countries had a population of over 1.000.000.000 in the year 2007! The required packages are already loaded.
- Use the
filter()
function on the gapminder tibble. - Filter all rows where the
year
variable is equal to 2007 and the populationpop
is greater than 1000000000!
Filter data frame rows is an excerpt from the course Introduction to R, which is available for free at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.