R:case4base – Sorting data with base R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
In this post in the R:case4base series we will examine sorting (ordering) data in base R. We will learn to sort our data based on one or multiple columns, with ascending or descending order and as always look at alternatives to base R, namely the tidyverse’s dplyr and data.table to show how we can achieve the same results.
It is recommended to first have a look at the post on subsetting to understand the concepts underlying the sorting process in more detail.
Contents
How to use this article
- This article is best used with an R session opened in a window next to it – you can test and play with the code yourself instantly while reading. Assuming the author did not fail miserably, the code will work as-is even with vanilla R, no packages or setup needed – it is a
case4base
after all! - If you have no time for reading, you can click here to get just the code with commentary
First, let’s read in yearly data on gross disposable income of household in the EU countries into R (click here to download):
gdi <- read.csv( stringsAsFactors = FALSE , url("https://jozefhajnala.gitlab.io/r/post/data/ESA2010_GDI.csv") ) head(gdi[, 1:6, drop = FALSE]) ## country Y.1995 Y.1996 Y.1997 Y.1998 Y.1999 ## 1 EU 28 NA NA NA NA 5982392.8 ## 2 Euro area 19 NA NA NA NA 4393727.3 ## 3 Belgium 140734.1 141599.4 145023.2 149705.2 153804.0 ## 4 Bulgaria 1036.0 1468.1 12367.4 14921.1 16052.8 ## 5 Czech Republic 894042.0 1030001.0 1153966.0 1223783.0 1280040.0 ## 6 Denmark 566363.0 578102.0 591416.0 621236.0 614893.0
Please note that the figures in the data provided by Eurostat are presented in millions of euros for euro area countries, euro area and EU aggregates and in millions of national currency otherwise. This makes comparing the results between countries difficult, since one would need to do a proper time-dependent currency conversion and potentially inflation adjustment to get comparable data.
The goal of the article is therefore not really in presenting these concrete results, but to focus on the technical aspects and usefulness of the presented methods.
Subsetting as a mechanism for sorting data
Sorting a data frame is loosely coupled with subsetting. To get the rows of a data frame in order reverse to the current one, we can just subset the rows with an index that goes from the last row to the very first (or safer, zeroth) like so:
gdi_reversed_rows <- gdi[nrow(gdi):0, ]
We can take a very similar approach to reverse order the columns:
gdi_reversed_cols <- gdi[, ncol(gdi):0]
Or both rows and columns at the same time. We also add the drop = FALSE
for safety here as we omitted it in the 2 above examples for readability:
gdi_reversed <- gdi[nrow(gdi):0, ncol(gdi):0, drop = FALSE] head(gdi_reversed) ## Y.2016 Y.2015 Y.2014 Y.2013 Y.2012 Y.2011 Y.2010 ## 35 NA NA NA NA NA NA NA ## 34 NA 1631795 1438281.4 1268729.8 1081744.9 971545.3 807128.5 ## 33 458641 447094 449119.3 437596.6 428131.2 420404.9 412363.1 ## 32 1627136 1606745 1496128.0 1419380.0 1347970.0 1272065.0 1204442.0 ## 31 NA NA 1055733.5 980494.9 934077.3 872900.3 798916.7 ## 30 1330854 1298475 1269177.0 1219699.0 1195227.0 1160813.0 1151812.0 ## Y.2009 Y.2008 Y.2007 Y.2006 Y.2005 Y.2004 Y.2003 ## 35 NA NA NA NA NA NA NA ## 34 689431.6 NA NA NA NA NA NA ## 33 404446.9 399834.1 389468.0 368868.0 352620.1 341709.9 337742.9 ## 32 1150829.0 1105563.0 1021911.0 943515.0 975153.0 894892.0 854026.0 ## 31 858678.9 909995.1 827339.5 681058.3 631210.9 536194.9 478645.8 ## 30 1101109.0 1080225.0 1063178.0 1005630.0 966175.0 926670.0 893528.0 ## Y.2002 Y.2001 Y.2000 Y.1999 Y.1998 Y.1997 Y.1996 Y.1995 ## 35 NA NA NA NA NA NA NA NA ## 34 NA NA NA NA NA NA NA NA ## 33 335845.6 336581.4 326269.3 312478.7 303239.5 296324.6 291208.4 287865.4 ## 32 800130.0 727228.0 704697.0 660196.0 630865.0 582597.0 549694.0 522981.0 ## 31 447572.6 400145.0 369181.0 NA NA NA NA NA ## 30 857352.0 829908.0 789615.0 737419.0 715396.0 691951.0 656455.0 618959.0 ## country ## 35 Serbia ## 34 Turkey ## 33 Switzerland ## 32 Norway ## 31 Iceland ## 30 United Kingdom
Sorting data by contents of a column
To order the rows (countries) by GDI in 2016, we use the function order
, which finds the permutation that rearranges the values into ascending order and save that order into a variable called rowidx
. Then we simply use rowidx
to subset the rows of gdi
in the order we wanted:
rowidx <- order(gdi[, "Y.2016"]) rowidx ## [1] 13 8 16 18 17 26 27 4 9 28 24 22 3 21 33 11 6 23 14 30 12 32 7 ## [24] 29 5 2 1 10 15 19 20 25 31 34 35 gdi_sorted <- gdi[rowidx, , drop = FALSE] # We can of course do it in one go: gdi_sorted <- gdi[order(gdi[, "Y.2016"]), , drop = FALSE] # Look at the 2 relevant columns of the result gdi_sorted[, c(1, 23)] ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 4 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 28 Finland 126590.00 ## 24 Portugal 128789.39 ## 22 Austria 214980.60 ## 3 Belgium 243825.50 ## 21 Netherlands 357383.00 ## 33 Switzerland 458641.00 ## 11 Spain 698701.00 ## 6 Denmark 1091542.00 ## 23 Poland 1136916.00 ## 14 Italy 1142273.40 ## 30 United Kingdom 1330854.00 ## 12 France 1425435.00 ## 32 Norway 1627136.00 ## 7 Germany 2019917.00 ## 29 Sweden 2402587.00 ## 5 Czech Republic 2523229.00 ## 2 Euro area 19 6736686.43 ## 1 EU 28 9454683.60 ## 10 Greece NA ## 15 Cyprus NA ## 19 Hungary NA ## 20 Malta NA ## 25 Romania NA ## 31 Iceland NA ## 34 Turkey NA ## 35 Serbia NA
To order in descending order, we can use decreasing = TRUE
, to see NA
s first we can use na.last = FALSE
rowidx <- order(gdi[, "Y.2016"], decreasing = TRUE, na.last = FALSE) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 10 Greece NA ## 15 Cyprus NA ## 19 Hungary NA ## 20 Malta NA ## 25 Romania NA ## 31 Iceland NA ## 34 Turkey NA ## 35 Serbia NA ## 1 EU 28 9454683.60 ## 2 Euro area 19 6736686.43 ## 5 Czech Republic 2523229.00 ## 29 Sweden 2402587.00 ## 7 Germany 2019917.00 ## 32 Norway 1627136.00 ## 12 France 1425435.00 ## 30 United Kingdom 1330854.00 ## 14 Italy 1142273.40 ## 23 Poland 1136916.00 ## 6 Denmark 1091542.00 ## 11 Spain 698701.00 ## 33 Switzerland 458641.00 ## 21 Netherlands 357383.00 ## 3 Belgium 243825.50 ## 22 Austria 214980.60 ## 24 Portugal 128789.39 ## 28 Finland 126590.00 ## 9 Ireland 97318.90 ## 4 Bulgaria 60237.00 ## 27 Slovakia 48882.91 ## 26 Slovenia 24756.63 ## 17 Lithuania 24743.49 ## 18 Luxembourg 20155.80 ## 16 Latvia 15737.79 ## 8 Estonia 12548.30 ## 13 Croatia 0.00
Sorting by multiple vectors with different order
That looks good, but we may want to order the rows that have NA
as GDI in 2016 alphabetically by country (or generalize even further). To use multiple vectors for ordering is also very simple:
rowidx <- order(gdi[, "Y.2016"], gdi[, "country"]) head(gdi[rowidx, c(1, 23), drop = FALSE]) ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63
To order by multiple columns differently, for numeric vectors we can use a simple -
, since negated numeric vector will order in reverse order. To order our GDI dataset by GDI in 2016 descending and then by country alphabetically:
rowidx <- order(-gdi[, "Y.2016"], gdi[, "country"]) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 1 EU 28 9454683.60 ## 2 Euro area 19 6736686.43 ## 5 Czech Republic 2523229.00 ## 29 Sweden 2402587.00 ## 7 Germany 2019917.00 ## 32 Norway 1627136.00 ## 12 France 1425435.00 ## 30 United Kingdom 1330854.00 ## 14 Italy 1142273.40 ## 23 Poland 1136916.00 ## 6 Denmark 1091542.00 ## 11 Spain 698701.00 ## 33 Switzerland 458641.00 ## 21 Netherlands 357383.00 ## 3 Belgium 243825.50 ## 22 Austria 214980.60 ## 24 Portugal 128789.39 ## 28 Finland 126590.00 ## 9 Ireland 97318.90 ## 4 Bulgaria 60237.00 ## 27 Slovakia 48882.91 ## 26 Slovenia 24756.63 ## 17 Lithuania 24743.49 ## 18 Luxembourg 20155.80 ## 16 Latvia 15737.79 ## 8 Estonia 12548.30 ## 13 Croatia 0.00 ## 15 Cyprus NA ## 10 Greece NA ## 19 Hungary NA ## 31 Iceland NA ## 20 Malta NA ## 25 Romania NA ## 35 Serbia NA ## 34 Turkey NA
For non-numeric vectors, we can take advantage of the xtfrm
function, which returns a numeric vector which will sort in the same order as the one provided to it. Then we just use -
to get a vector that will order in reverse order. To order our GDI dataset by GDI ascending in 2016 and then by country reverse-alphabetically:
rowidx <- order(gdi[, "Y.2016"], -xtfrm(gdi[, "country"])) gdi[rowidx, c(1, 23), drop = FALSE] ## country Y.2016 ## 13 Croatia 0.00 ## 8 Estonia 12548.30 ## 16 Latvia 15737.79 ## 18 Luxembourg 20155.80 ## 17 Lithuania 24743.49 ## 26 Slovenia 24756.63 ## 27 Slovakia 48882.91 ## 4 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 28 Finland 126590.00 ## 24 Portugal 128789.39 ## 22 Austria 214980.60 ## 3 Belgium 243825.50 ## 21 Netherlands 357383.00 ## 33 Switzerland 458641.00 ## 11 Spain 698701.00 ## 6 Denmark 1091542.00 ## 23 Poland 1136916.00 ## 14 Italy 1142273.40 ## 30 United Kingdom 1330854.00 ## 12 France 1425435.00 ## 32 Norway 1627136.00 ## 7 Germany 2019917.00 ## 29 Sweden 2402587.00 ## 5 Czech Republic 2523229.00 ## 2 Euro area 19 6736686.43 ## 1 EU 28 9454683.60 ## 34 Turkey NA ## 35 Serbia NA ## 25 Romania NA ## 20 Malta NA ## 31 Iceland NA ## 19 Hungary NA ## 10 Greece NA ## 15 Cyprus NA
Alternatives to base R
Using the tidyverse
The dplyr
package comes with a set of very user-friendly functions that are very easy to use, especially in an interactive setting where we know the column names up front, so we can take advantage of the non-standard evaluation:
library(dplyr) gdi %>% arrange(Y.2016, desc(country)) %>% select(1, 23) ## country Y.2016 ## 1 Croatia 0.00 ## 2 Estonia 12548.30 ## 3 Latvia 15737.79 ## 4 Luxembourg 20155.80 ## 5 Lithuania 24743.49 ## 6 Slovenia 24756.63 ## 7 Slovakia 48882.91 ## 8 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 10 Finland 126590.00 ## 11 Portugal 128789.39 ## 12 Austria 214980.60 ## 13 Belgium 243825.50 ## 14 Netherlands 357383.00 ## 15 Switzerland 458641.00 ## 16 Spain 698701.00 ## 17 Denmark 1091542.00 ## 18 Poland 1136916.00 ## 19 Italy 1142273.40 ## 20 United Kingdom 1330854.00 ## 21 France 1425435.00 ## 22 Norway 1627136.00 ## 23 Germany 2019917.00 ## 24 Sweden 2402587.00 ## 25 Czech Republic 2523229.00 ## 26 Euro area 19 6736686.43 ## 27 EU 28 9454683.60 ## 28 Turkey NA ## 29 Serbia NA ## 30 Romania NA ## 31 Malta NA ## 32 Iceland NA ## 33 Hungary NA ## 34 Greece NA ## 35 Cyprus NA
If we need to provide the names of the columns instead, we can use the arrange_at
:
gdi %>% arrange_at("country", desc) %>% arrange_at("Y.2016") %>% select(1, 23) ## country Y.2016 ## 1 Croatia 0.00 ## 2 Estonia 12548.30 ## 3 Latvia 15737.79 ## 4 Luxembourg 20155.80 ## 5 Lithuania 24743.49 ## 6 Slovenia 24756.63 ## 7 Slovakia 48882.91 ## 8 Bulgaria 60237.00 ## 9 Ireland 97318.90 ## 10 Finland 126590.00 ## 11 Portugal 128789.39 ## 12 Austria 214980.60 ## 13 Belgium 243825.50 ## 14 Netherlands 357383.00 ## 15 Switzerland 458641.00 ## 16 Spain 698701.00 ## 17 Denmark 1091542.00 ## 18 Poland 1136916.00 ## 19 Italy 1142273.40 ## 20 United Kingdom 1330854.00 ## 21 France 1425435.00 ## 22 Norway 1627136.00 ## 23 Germany 2019917.00 ## 24 Sweden 2402587.00 ## 25 Czech Republic 2523229.00 ## 26 Euro area 19 6736686.43 ## 27 EU 28 9454683.60 ## 28 Turkey NA ## 29 Serbia NA ## 30 Romania NA ## 31 Malta NA ## 32 Iceland NA ## 33 Hungary NA ## 34 Greece NA ## 35 Cyprus NA
Using data.table
There are multiple ways to achieve the desired results with data.table, the one syntactically similar to base R is:
library(data.table) gdidt <- as.data.table(gdi) gdidt[order(Y.2016, -country), c(1, 23)] ## country Y.2016 ## 1: Croatia 0.00 ## 2: Estonia 12548.30 ## 3: Latvia 15737.79 ## 4: Luxembourg 20155.80 ## 5: Lithuania 24743.49 ## 6: Slovenia 24756.63 ## 7: Slovakia 48882.91 ## 8: Bulgaria 60237.00 ## 9: Ireland 97318.90 ## 10: Finland 126590.00 ## 11: Portugal 128789.39 ## 12: Austria 214980.60 ## 13: Belgium 243825.50 ## 14: Netherlands 357383.00 ## 15: Switzerland 458641.00 ## 16: Spain 698701.00 ## 17: Denmark 1091542.00 ## 18: Poland 1136916.00 ## 19: Italy 1142273.40 ## 20: United Kingdom 1330854.00 ## 21: France 1425435.00 ## 22: Norway 1627136.00 ## 23: Germany 2019917.00 ## 24: Sweden 2402587.00 ## 25: Czech Republic 2523229.00 ## 26: Euro area 19 6736686.43 ## 27: EU 28 9454683.60 ## 28: Turkey NA ## 29: Serbia NA ## 30: Romania NA ## 31: Malta NA ## 32: Iceland NA ## 33: Hungary NA ## 34: Greece NA ## 35: Cyprus NA ## country Y.2016
Another option is to take advantage of the setorderv
method provided by data.table. The important distinction is that this will sort the existing data.table in place, changing the source object. The other methods used above leave the source object untouched:
# This will sort the gdidt by reference - changing the input object setorderv(gdidt, c("Y.2016", "country"), c(1, -1), na.last = TRUE) # So we now just subset the (already sorted) gdidt gdidt[, c(1, 23)] ## country Y.2016 ## 1: Croatia 0.00 ## 2: Estonia 12548.30 ## 3: Latvia 15737.79 ## 4: Luxembourg 20155.80 ## 5: Lithuania 24743.49 ## 6: Slovenia 24756.63 ## 7: Slovakia 48882.91 ## 8: Bulgaria 60237.00 ## 9: Ireland 97318.90 ## 10: Finland 126590.00 ## 11: Portugal 128789.39 ## 12: Austria 214980.60 ## 13: Belgium 243825.50 ## 14: Netherlands 357383.00 ## 15: Switzerland 458641.00 ## 16: Spain 698701.00 ## 17: Denmark 1091542.00 ## 18: Poland 1136916.00 ## 19: Italy 1142273.40 ## 20: United Kingdom 1330854.00 ## 21: France 1425435.00 ## 22: Norway 1627136.00 ## 23: Germany 2019917.00 ## 24: Sweden 2402587.00 ## 25: Czech Republic 2523229.00 ## 26: Euro area 19 6736686.43 ## 27: EU 28 9454683.60 ## 28: Turkey NA ## 29: Serbia NA ## 30: Romania NA ## 31: Malta NA ## 32: Iceland NA ## 33: Hungary NA ## 34: Greece NA ## 35: Cyprus NA ## country Y.2016
Quick benchmarking
For a quick overview, lets look at a basic benchmark without package loading overhead for each of the mentioned packages. To do the benchmarking, we will use a very slightly modified flights
data frame provided by Hadley Wickham’s nycflights13 package.
library(data.table) library(dplyr) dataurl <- "https://jozefhajnala.gitlab.io/r/post/data/" flights <- readRDS(url(paste0(dataurl, "r006/flights.rds"))) flightsdt <- as.data.table(flights) bench <- microbenchmark::microbenchmark( times = 100, base_order = {flights[order(flights[, "flight"], -xtfrm(flights[, "carrier"])), ] }, dt_oder = {flightsdt[order(flight, -carrier), ] }, dplyr_nse = {flights %>% arrange(flight, desc(carrier)) }, dplyr_scoped = {flights %>% arrange_at("carrier", desc) %>% arrange_at("flight") } )
Under our particular circumstances, base R’s method seems to be the lowest of the options with data.table being the fastest.
TL;DR - Just want the code
No time for reading? Click here to get just the code with commentary
References
- post on subsetting on this blog
- documentation for base::order
- dplyr’s arrange function reference
- introduction to data.table
- original eurostat data source
Did you find the article helpful or interesting? Help others find it by sharing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.