Sort data frames by columns
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To select areas of interest in a data frame they often need to be ordered by specific columns. The dplyr arrange()
function supports data frame orderings by multiple columns in ascending and descending order.
- Use the
arrange()
function to sort data frames. - Sort data frames by multiple columns using
arrange()
.
arrange(, ) arrange(, , , ...)
The arrange() function with a single column
arrange(, ) arrange(, , , ...)
The arrange()
function orders the rows of a data frame. It takes a data frame or a tibble as the first parameter and the names of the columns based on which the rows should be ordered as additional parameters. Let’s assume, we want to answer the question: Which states had the highest percentage of Republican voters in the 2016 US presidential election? To answer this question, in the following example we use the pres_results_2016
data frame, containing information only for the 2016 US presidential election. We arrange()
the data frame based on the rep
column (Republican votes in percentage):
arrange(pres_results_2016, rep) # A tibble: 51 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 2016 DC 312575 0.905 0.0407 0.0335 2 2016 HI 437664 0.610 0.294 0.0958 3 2016 VT 320467 0.557 0.298 0.0737 # … with 48 more rows
As you can see in the output, the data frame is sorted in an ascending order based on the rep
column. However, we would prefer to have the results in a descending order, so that we can instantly see the state
with the highest rep
percentage. To sort a column in a descending order, all we need to do is apply the desc()
function on the given column inside the arrange()
function:
arrange(pres_results_2016, desc(rep)) # A tibble: 51 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 2016 WV 713051 0.265 0.686 0.0489 2 2016 WY 258788 0.216 0.674 0.0830 3 2016 OK 1452992 0.289 0.653 0.0575 # … with 48 more rows
Arranging is not only possible on numeric values, but on character values as well. In that case, dplyr sorts the rows in alphabetic order. We can arrange character columns just like numeric ones:
arrange(pres_results_2016, state) # A tibble: 51 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 2016 AK 318608 0.366 0.513 0.0928 2 2016 AL 2123372 0.344 0.621 0.0254 3 2016 AR 1130635 0.337 0.606 0.0577 # … with 48 more rows
Exercise: Use arrange() based on a single column
The gapminder_2007
dataset contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect which country had the lowest life expectancy lifeExp
in 2007! The dplyr package is already loaded.
- Apply the
arrange()
function on thegapminder_2007
tibble - Order the tibble based on the
lifeExp
column
Exercise: Use arrange() in combination with desc()
The gapminder_2007
dataset contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect which countries had the largest population in 2007! The dplyr package is already loaded.
- Apply the
arrange()
function on thegapminder_2007
tibble. - Sort the tibble in a descending order based on the
pop
column.
The arrange() function with multiple columns
We can use the arrange()
function on multiple columns as well. In this case the order of the columns in the function parameters, sets a hierarchy of ordering. The function starts by ordering the rows based on the first column defined in the parameters. In case there are several rows with the same value, the function decides the order based on the second column defined in the parameters. If there are still multiple rows with the same values, the function decides based on the third column defined in the parameters (if defined) and so on.
In the following example we use the pres_results_subset
data frame, containing election results only for the states: "TX"
(Texas),"UT"
(Utah) and "FL"
(Florida). First we sort the data frame in a descending order based on the year
column. Then, we add a second level, and order the data frame based on the dem
column:
arrange(pres_results_subset, year, dem) # A tibble: 33 x 6 year state total_votes dem rep other <dbl> <chr> <dbl> <dbl> <dbl> <dbl> 1 1976 UT 541218 0.336 0.624 0.0392 2 1976 TX 4071884 0.511 0.480 0.00817 3 1976 FL 3150631 0.519 0.466 0.0143 # … with 30 more rows
As you can see in the output, the data frame is overall ordered based on the year
column. However, when the value of year
is the same, the order of the rows is decided by the dem
column.
Exercise: Use arrange() based on multiple columns
The gapminder_2007
tibble contains economic and demographic data about various countries for the year 2007. Arrange the tibble and inspect for each continent, which countries had the highest life expectancy in 2007! The dplyr package is already loaded.
- Apply the
arrange()
function on thegapminder_2007
tibble. - Order the tibble based on the
continent
column! - In case there are rows with the same
continent
, sort the tibble in a descending order based on thelifeExp
column!
Quiz: arrange() Function
Which of the following statements are true about thearrange()
function?
- The
arrange()
function orders the rows of a data frame. - To
arrange()
the values of column in an ascending order, we need to use theasc()
function. - To
arrange()
the values of column in a descending order, we need to use thedesc()
function. - You can only
arrange()
a data frame based on one column.
Sort data frames by columns is an excerpt from the course Introduction to R, which is available for free at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.