model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA # order by two columns, first descending, second ascending: mycars[order(-mycars$cyl, mycars$mpg), ] #> model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 6 Valiant 6 1 3 18.1 225.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 12 Honda Civic 4 1 4 NA NA Important to note is that, as default, NAs are sorted to bottom of the data.frame (read more about sorting NA values below). Base R’s order() has an argument na.last which is set to TRUE by default and sorts NA to the bottom: mycars[order(mycars$mpg),] #> model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA Setting na.last = FALSE sorts NAs to the top: mycars[order(mycars$mpg, na.last = FALSE),] #> model cyl vs gear mpg disp #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA #> 7 Duster 360 8 0 3 14.3 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 A bit surprisingly na.last can also be set NA which will remove NAs from the vector or data.frame making it a combined filter and ordering operation: mycars[order(mycars$mpg, na.last = NA),] #> model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 For some users this behavior might not be very intuitive, which is why splitting ordering and filtering operations is preferable for code clarity. This is especially relevant for users coming from ‘dplyr’ or ‘data.table’. dplyr::arrange() always orders NAs last and doesn’t come with an option to change this behavior. data.table::setorder() has an argument na.last, but (1) it can only be set TRUE or FALSE and (2) it defaults to the former which is the opposite of what base::order() does: mycarsDT model cyl vs gear mpg disp #> 1: Cadillac Fleetwood 8 0 3 NA NA #> 2: Honda Civic 4 1 4 NA NA #> 3: Duster 360 8 0 3 14.3 360.0 #> 4: Valiant 6 1 3 18.1 225.0 #> 5: Hornet Sportabout 8 0 3 18.7 360.0 #> 6: Merc 280 6 1 4 19.2 167.6 #> 7: Mazda RX4 6 0 4 21.0 160.0 #> 8: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 9: Hornet 4 Drive 6 1 3 21.4 258.0 #> 10: Datsun 710 4 1 4 22.8 108.0 #> 11: Merc 230 4 1 4 22.8 140.8 #> 12: Merc 240D 4 1 4 24.4 146.7 How different vector types are sorted and some more information about what exactly happens, when we call df[order(variable), ] can be found in the info box below. How does the above syntax work? Lets look at a three-column data.frame dat x y z #> 1 100 a TRUE #> 2 1 c FALSE #> 3 10 b TRUE The output of order(dat$x) shows us the row numbers in which we would have to order our data.frame to make the values in dat$x run from smallest to highest. order(dat$x) #> [1] 2 3 1 We can read this as: “The second row should come first, the third row should come second, and the first row should come last”. To actually order the rows according to this logic, we subset the data.frame by its rows according to the logic df[row_index, ]: dat x y z #> 2 1 c FALSE #> 3 10 b TRUE #> 1 100 a TRUE Now the rows are in the desired order, but note, that the row indices still correspond to the original indices. This makes it (relatively) easy to restore the old ordering: dat x y z #> 1 100 a TRUE #> 2 1 c FALSE #> 3 10 b TRUE Note hat the logic of ordering is the same for the most common data types. order() sorts the values from smallest to largest. For integer and double vectors (including Dates and date times, like POSIXct) this is pretty straightforward. Also for character vectors the logic is simple: "a" is “smaller” than "b", so the ascending order goes from A to Z, with small coming before capital letters. We can always verify how character values relate: "a" > "A" #> [1] FALSE For logical vectors FALSE can be read as 0 and TRUE as 1 which is the ascending order in which the values will be sorted. Although this makes perfectly sense, given that this is the way logical vectors are coerced to numeric, we will see later that this might be confusing when specifying explict values to sort by. Finally, the only special case are factor variables. Here the sorting follows the factor levels(): dat$y x y z #> 3 10 b TRUE #> 1 100 a TRUE #> 2 1 c FALSE 2. Ordering by a character vector with matching names Sometimes sorting by numeric and character variables in ascending or descending order is not enough. There are cases where we have a given non-alphabetical order of names which we want to apply to our data. In this case we use factor variables. Let’s assume we have a given order of model names that we want to sort our data by. Then we have two choices. We either transform the mycars$model column into a factor and supply our desired order of names as factor levels (see info box: “the logic of ordering rows” above). # a character vector with matching names my_vec model cyl vs gear mpg disp #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 12 Honda Civic 4 1 4 NA NA #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 7 Duster 360 8 0 3 14.3 360.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 11 Cadillac Fleetwood 8 0 3 NA NA Forgetting to unname() our list of vectors can lead to problems, which will be elaborated in the warning box below. The documentation of order() explicitly contains a warning, reminding us to unname() data.frames that we pass to do.call("order", df). In most cases, forgetting to unname() a data.frame won’t get us in trouble. If, however, one of the column names corresponds to an argument of order(), that is na.last, decreasing or method, forgetting to unname() the data.frame will in most cases do something we don’t expect, and probably even notice. Let’s create a small toy data.frame with three columns, x, y and decreasing, the latter indicating whether x - y is greater or less than zero: dat x y decreasing #> 1 3 5 FALSE #> 2 3 2 TRUE #> 3 6 4 TRUE When we want to order by all columns in ascending order, using the do.call() approach showed above, we get the following result: dat[do.call("order", dat), ] #> x y decreasing #> 1 3 5 FALSE #> 2 3 2 TRUE #> 3 6 4 TRUE When do.call() is constructing the call, it is actually evaluating this call: #> order(list(x = c(3, 3, 6), y = c(5, 2, 4), decreasing = c(FALSE, #> TRUE, TRUE))) As we can see, the third column is passed to order()s decreasing argument. It is interesting to note that this doesn’t seem to throw an error, although we order by two vectors and pass three Boolean values to decreasing. When we unname() our data.frame before passing it to do.call(), we get the correct (and slightly different) result: dat[do.call("order", unname(dat)), ] #> x y decreasing #> 2 3 2 TRUE #> 1 3 5 FALSE #> 3 6 4 TRUE Under the hood, this evaluates to the correct call: #> order(list(c(3, 3, 6), c(5, 2, 4), c(FALSE, TRUE, TRUE))) To sum-up: it is good practice to unname() data.frames when passing them to do.call(). 6. Ordering by a list or vector of column names Similar to sorting a data.frame by all variables, we sometimes have a vector of variables names we want to sort by. Here we can apply the same approach as above and use do.call("order", my_df[,mycols]) on our data.frame. Let’s further assume that we want to sort some columns ascending and some descending. In this case we combine both arguments, the vectors to sort by and their decreasing order, in a list and supply it to do.call("order", our_list_of_arguments): # Let's say we have the names of the columns ... # ... we want to order by in a vector mycols 9 Merc 230 4 1 4 22.8 140.8 #> 10 Merc 280 6 1 4 19.2 167.6 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 7 Duster 360 8 0 3 14.3 360.0 #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA Summing up: Ordering rows in base R Ordering in base R boils down to subsetting a data.frame by itself in a different order. We create this new order either by applying order() directly to one or several variables or expressions, or by wrapping it in a do.call() together with a list of arguments. While the former can be considered an easy, straightforward operation, the later requires quite some knowledge about constructing calls with do.call() and the possible pitfalls we might encounter - think of: unname(). Nevertheless, once useRs have understood the advanced concept of do.call() and how to use it, more advanced ordering operations can be tackled easily well. ‘data․table’ When it comes to orderings rows ‘data.table’ is not much different than base R. Most of the ordering operations introduced above can be applied almost identically on a data.table. While the syntax resembles base R, ‘data.table’ is using its own implementation of order() under the hood, data.table:::forder(), which is optimized and much faster compared to base R. In this section we will first look at how to use order() on the seven examples from above. The aim is to stay close to base R, but account for ‘data.table’s syntax specific features. Apart from order(), ‘data.table’ comes with its two own ordering functions, setorder() and setorderv(), which modify a data.table object by reference - that is without making a copy. This makes them more memory efficient compared to the already optimized implementation of data.table:::forder(). The following code chucks use a data.table version of our data: library(data.table) mycarsDT 1: Duster 360 8 0 3 14.3 360.0 #> 2: Valiant 6 1 3 18.1 225.0 #> 3: Hornet Sportabout 8 0 3 18.7 360.0 #> 4: Merc 280 6 1 4 19.2 167.6 #> 5: Mazda RX4 6 0 4 21.0 160.0 #> 6: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 7: Hornet 4 Drive 6 1 3 21.4 258.0 #> 8: Datsun 710 4 1 4 22.8 108.0 #> 9: Merc 230 4 1 4 22.8 140.8 #> 10: Merc 240D 4 1 4 24.4 146.7 #> 11: Cadillac Fleetwood 8 0 3 NA NA #> 12: Honda Civic 4 1 4 NA NA Note that, unlike base::order(), ‘data.table’s implementation doesn’t keep track of the ordered rows. The rownames range from 1 to 12 like before (in base R the rownames showed for each row where it was originally coming from, allowing us to restore the order). With the above syntax features in mind, we can rewrite the first five ordering examples as follows: # 1. Ordering by one or several variables mycarsDT[order(mpg)] mycarsDT[order(-cyl, mpg)] # 2. Ordering by a character vector with matching names mycarsDT[order(factor(model, levels = my_vec))] # 3. Ordering by a simple expression mycarsDT[order(model != "Hornet Sportabout")] # 4. Ordering by a complex expression (positive numeric column) mycarsDT[order(vs, ifelse(vs == 1, -mpg, mpg))] # 4. Ordering by a complex expression (character column) mycarsDT[order(vs, ifelse(vs == 1, -xtfrm(model), xtfrm(model)))] # 5. Ordering by all columns of a data.frame mycarsDT[do.call("order", unname(mycarsDT[, -1]))] # 1. Ordering by one or several variables mycarsDT[order(mpg)] #> model cyl vs gear mpg disp #> 1: Duster 360 8 0 3 14.3 360.0 #> 2: Valiant 6 1 3 18.1 225.0 #> 3: Hornet Sportabout 8 0 3 18.7 360.0 #> 4: Merc 280 6 1 4 19.2 167.6 #> 5: Mazda RX4 6 0 4 21.0 160.0 #> 6: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 7: Hornet 4 Drive 6 1 3 21.4 258.0 #> 8: Datsun 710 4 1 4 22.8 108.0 #> 9: Merc 230 4 1 4 22.8 140.8 #> 10: Merc 240D 4 1 4 24.4 146.7 #> 11: Cadillac Fleetwood 8 0 3 NA NA #> 12: Honda Civic 4 1 4 NA NA mycarsDT[order(-cyl, mpg)] #> model cyl vs gear mpg disp #> 1: Duster 360 8 0 3 14.3 360.0 #> 2: Hornet Sportabout 8 0 3 18.7 360.0 #> 3: Cadillac Fleetwood 8 0 3 NA NA #> 4: Valiant 6 1 3 18.1 225.0 #> 5: Merc 280 6 1 4 19.2 167.6 #> 6: Mazda RX4 6 0 4 21.0 160.0 #> 7: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 8: Hornet 4 Drive 6 1 3 21.4 258.0 #> 9: Datsun 710 4 1 4 22.8 108.0 #> 10: Merc 230 4 1 4 22.8 140.8 #> 11: Merc 240D 4 1 4 24.4 146.7 #> 12: Honda Civic 4 1 4 NA NA # 2. Ordering by a character vector with matching names mycarsDT[order(factor(model, levels = my_vec))] #> model cyl vs gear mpg disp #> 1: Hornet Sportabout 8 0 3 18.7 360.0 #> 2: Cadillac Fleetwood 8 0 3 NA NA #> 3: Valiant 6 1 3 18.1 225.0 #> 4: Hornet 4 Drive 6 1 3 21.4 258.0 #> 5: Mazda RX4 6 0 4 21.0 160.0 #> 6: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 7: Honda Civic 4 1 4 NA NA #> 8: Datsun 710 4 1 4 22.8 108.0 #> 9: Duster 360 8 0 3 14.3 360.0 #> 10: Merc 240D 4 1 4 24.4 146.7 #> 11: Merc 230 4 1 4 22.8 140.8 #> 12: Merc 280 6 1 4 19.2 167.6 # 3. Ordering by a simple expression mycarsDT[order(model != "Hornet Sportabout")] #> model cyl vs gear mpg disp #> 1: Hornet Sportabout 8 0 3 18.7 360.0 #> 2: Mazda RX4 6 0 4 21.0 160.0 #> 3: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4: Datsun 710 4 1 4 22.8 108.0 #> 5: Hornet 4 Drive 6 1 3 21.4 258.0 #> 6: Valiant 6 1 3 18.1 225.0 #> 7: Duster 360 8 0 3 14.3 360.0 #> 8: Merc 240D 4 1 4 24.4 146.7 #> 9: Merc 230 4 1 4 22.8 140.8 #> 10: Merc 280 6 1 4 19.2 167.6 #> 11: Cadillac Fleetwood 8 0 3 NA NA #> 12: Honda Civic 4 1 4 NA NA # 4. Ordering by a complex expression (positive numeric column) mycarsDT[order(vs, ifelse(vs == 1, -mpg, mpg))] #> model cyl vs gear mpg disp #> 1: Duster 360 8 0 3 14.3 360.0 #> 2: Hornet Sportabout 8 0 3 18.7 360.0 #> 3: Mazda RX4 6 0 4 21.0 160.0 #> 4: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 5: Cadillac Fleetwood 8 0 3 NA NA #> 6: Merc 240D 4 1 4 24.4 146.7 #> 7: Datsun 710 4 1 4 22.8 108.0 #> 8: Merc 230 4 1 4 22.8 140.8 #> 9: Hornet 4 Drive 6 1 3 21.4 258.0 #> 10: Merc 280 6 1 4 19.2 167.6 #> 11: Valiant 6 1 3 18.1 225.0 #> 12: Honda Civic 4 1 4 NA NA # 4. Ordering by a complex expression (character column) mycarsDT[order(vs, ifelse(vs == 1, -xtfrm(model), xtfrm(model)))] #> model cyl vs gear mpg disp #> 1: Cadillac Fleetwood 8 0 3 NA NA #> 2: Duster 360 8 0 3 14.3 360.0 #> 3: Hornet Sportabout 8 0 3 18.7 360.0 #> 4: Mazda RX4 6 0 4 21.0 160.0 #> 5: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 6: Valiant 6 1 3 18.1 225.0 #> 7: Merc 280 6 1 4 19.2 167.6 #> 8: Merc 240D 4 1 4 24.4 146.7 #> 9: Merc 230 4 1 4 22.8 140.8 #> 10: Hornet 4 Drive 6 1 3 21.4 258.0 #> 11: Honda Civic 4 1 4 NA NA #> 12: Datsun 710 4 1 4 22.8 108.0 # 5. Ordering by all columns of a data.frame mycarsDT[do.call("order", unname(mycarsDT[, -1]))] #> model cyl vs gear mpg disp #> 1: Datsun 710 4 1 4 22.8 108.0 #> 2: Merc 230 4 1 4 22.8 140.8 #> 3: Merc 240D 4 1 4 24.4 146.7 #> 4: Honda Civic 4 1 4 NA NA #> 5: Mazda RX4 6 0 4 21.0 160.0 #> 6: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 7: Valiant 6 1 3 18.1 225.0 #> 8: Hornet 4 Drive 6 1 3 21.4 258.0 #> 9: Merc 280 6 1 4 19.2 167.6 #> 10: Duster 360 8 0 3 14.3 360.0 #> 11: Hornet Sportabout 8 0 3 18.7 360.0 #> 12: Cadillac Fleetwood 8 0 3 NA NA Note that the code chunk above shows only the ordering operation without assignment. To actually transform the data.table object we would need to assign the calls above to a new (or the same) object name. When using a vector of column names to subset a data.table, as we did in the sixth example, we need to precede the vector containing the column names (here: mycol) with a double dot .. to tell ‘data.table’ that we are looking for an external vector and not a column named mycol inside our data.table. # 6. Ordering by a list or vector of column names mycols 3: Merc 240D 4 1 4 24.4 146.7 #> 4: Merc 230 4 1 4 22.8 140.8 #> 5: Merc 280 6 1 4 19.2 167.6 #> 6: Hornet 4 Drive 6 1 3 21.4 258.0 #> 7: Hornet Sportabout 8 0 3 18.7 360.0 #> 8: Datsun 710 4 1 4 22.8 108.0 #> 9: Valiant 6 1 3 18.1 225.0 #> 10: Duster 360 8 0 3 14.3 360.0 #> 11: Cadillac Fleetwood 8 0 3 NA NA #> 12: Honda Civic 4 1 4 NA NA data.table’s setorder functions ‘data.table’ comes with its own two ordering functions, setorder() and setorderv(), which modify a data.table object “by reference”, that is without making a copy. This is especially helpful when we are dealing with data that takes up a lot of memory and where we want to avoid unnecessary copies. As first argument, x, both functions take a data.table. As second argument setorder() uses the ellipsis ... which allows us to supply one or several bare column names to order by. setorder() sorts in ascending order as default and allows the minus symbol - as prefix to sort a column in decreasing order. setorder() can be straight forward applied to our first example: # 1. Ordering by one or several variables setorder(mycarsDT, mpg) setorder(mycarsDT, -mpg, cyl, na.last = TRUE) setorderv()s second argument is cols which takes a character vector of column names and defaults to the column names of the data.table supplied in x. To specify an ascending or descending order we can supply a numeric vector of 1 and -1 to the order argument. setorderv() can be applied to the examples 5. (ordering by all columns) and 6. (ordering by a vector of column names): # 5. Ordering by all columns of a data.frame setorderv(mycarsDT, cols = names(mycarsDT)[-1], na.last = TRUE) # 6. Ordering by a list or vector of column names setorderv(mycarsDT, cols = mycols, order = c(-1,1), na.last = TRUE) Unlike base::order(), both functions default to sorting NAs first. We need to set na.last = TRUE to reproduce our base R examples from above. Unfortunately, setorder() doesn’t support arbitrary expressions in the ellipsis .... We must only use bare column names and, optionally, a minus symbol as prefix -. All other expressions will throw an error. Similarly, setorderv() only accepts a character vector of column names. So there is no straightforward way to apply either function to the examples 2., 3., 4. and 7. from above. # Examples 2., 3. and 4. aren't working: setorder(mycarsDT, factor(model, levels = my_vec), na.last = TRUE) #> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: factor,my_vec setorder(mycarsDT, model != "Hornet Sportabout", na.last = TRUE) #> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: !=,Hornet Sportabout setorder(mycarsDT, ifelse(vs == 1, -mpg, mpg), na.last = TRUE) #> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: ifelse,vs == 1,-mpg We can, however, come up with a workaround to harness ‘data.table’s power of memory efficiently modifying a data.table by reference. In all four example the workaround is the same. We use setorder() and pass a modified data.table to it in which we create a new (or several) column(s) by reference. This new column contains the ordering logic. We use the extraction function [ right after setorder() to again delete the newly created column(s). Let’s take a look at example 3., ordering by a simple logicl expression. # 3. Ordering by a simple expression setorder( mycarsDT[, ord := model != "Hornet Sportabout"], vs, ord, na.last = TRUE )[, ord := NULL] # setorder() doesn't return the data, but changes it "in place" # so to look at the reordered data we have to print it: mycarsDT #> model cyl vs gear mpg disp #> 1: Hornet Sportabout 8 0 3 18.7 360.0 #> 2: Mazda RX4 6 0 4 21.0 160.0 #> 3: Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4: Duster 360 8 0 3 14.3 360.0 #> 5: Cadillac Fleetwood 8 0 3 NA NA #> 6: Merc 240D 4 1 4 24.4 146.7 #> 7: Datsun 710 4 1 4 22.8 108.0 #> 8: Merc 230 4 1 4 22.8 140.8 #> 9: Hornet 4 Drive 6 1 3 21.4 258.0 #> 10: Merc 280 6 1 4 19.2 167.6 #> 11: Valiant 6 1 3 18.1 225.0 #> 12: Honda Civic 4 1 4 NA NA Exchanging the expression after the walrus operator := allows us to apply this approach to the other examples, like ordering by factor levels on the fly factor(model, levels = my_vec) or by a complex expression like ifelse(vs == 1, -mpg, mpg). The approach above can be slightly adapted to help us with example no. 7, ordering by a vector of matching patterns. Here we first create three new variable names to order by order_cols % arrange(mpg) #> # A tibble: 12 × 6 #> # Groups: cyl [3] #> model cyl vs gear mpg disp #> #> 1 Duster 360 8 0 3 14.3 360 #> 2 Valiant 6 1 3 18.1 225 #> 3 Hornet Sportabout 8 0 3 18.7 360 #> 4 Merc 280 6 1 4 19.2 168. #> 5 Mazda RX4 6 0 4 21 160 #> 6 Mazda RX4 Wag 6 0 4 21 160 #> 7 Hornet 4 Drive 6 1 3 21.4 258 #> 8 Datsun 710 4 1 4 22.8 108 #> 9 Merc 230 4 1 4 22.8 141. #> 10 Merc 240D 4 1 4 24.4 147. #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA Here we’d probably expected the data to be sorted by mpg in ascending order within each group of cyl: mycars_tbl %>% arrange(cyl, mpg) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Datsun 710 4 1 4 22.8 108 #> 2 Merc 230 4 1 4 22.8 141. #> 3 Merc 240D 4 1 4 24.4 147. #> 4 Honda Civic 4 1 4 NA NA #> 5 Valiant 6 1 3 18.1 225 #> 6 Merc 280 6 1 4 19.2 168. #> 7 Mazda RX4 6 0 4 21 160 #> 8 Mazda RX4 Wag 6 0 4 21 160 #> 9 Hornet 4 Drive 6 1 3 21.4 258 #> 10 Duster 360 8 0 3 14.3 360 #> 11 Hornet Sportabout 8 0 3 18.7 360 #> 12 Cadillac Fleetwood 8 0 3 NA NA However, arrange() has an argument .by_group which is set to FALSE as default. Changing this to TRUE will make arrange work in the way we expected it: mycars_tbl %>% group_by(cyl) %>% arrange(mpg, .by_group = TRUE) #> # A tibble: 12 × 6 #> # Groups: cyl [3] #> model cyl vs gear mpg disp #> #> 1 Datsun 710 4 1 4 22.8 108 #> 2 Merc 230 4 1 4 22.8 141. #> 3 Merc 240D 4 1 4 24.4 147. #> 4 Honda Civic 4 1 4 NA NA #> 5 Valiant 6 1 3 18.1 225 #> 6 Merc 280 6 1 4 19.2 168. #> 7 Mazda RX4 6 0 4 21 160 #> 8 Mazda RX4 Wag 6 0 4 21 160 #> 9 Hornet 4 Drive 6 1 3 21.4 258 #> 10 Duster 360 8 0 3 14.3 360 #> 11 Hornet Sportabout 8 0 3 18.7 360 #> 12 Cadillac Fleetwood 8 0 3 NA NA The following code chunks use a tibble version of our data: library(dplyr) library(purrr) mycars_tbl % arrange(mpg) mycars_tbl %>% arrange(desc(cyl), mpg) Since arrange() accepts not only bare column names, but any arbitrary expressions we can easily rewrite the examples 2. to 4. as follows: (Note that the code chunks below show only the ordering operation without assignment. To actually transform the data object we would need to assign the calls above to a new, or the same, object name.) # 2. Ordering by a character vector with matching names my_vec % arrange(factor(model, levels = my_vec)) # 3. Ordering by a simple expression mycars_tbl %>% arrange(model != "Hornet Sportabout") # 4. Ordering by a complex expression (positive numeric column) mycars_tbl %>% arrange(vs, ifelse(vs == 1, desc(mpg), mpg)) # 2. Ordering by a character vector with matching names my_vec % arrange(factor(model, levels = my_vec)) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Hornet Sportabout 8 0 3 18.7 360 #> 2 Cadillac Fleetwood 8 0 3 NA NA #> 3 Valiant 6 1 3 18.1 225 #> 4 Hornet 4 Drive 6 1 3 21.4 258 #> 5 Mazda RX4 6 0 4 21 160 #> 6 Mazda RX4 Wag 6 0 4 21 160 #> 7 Honda Civic 4 1 4 NA NA #> 8 Datsun 710 4 1 4 22.8 108 #> 9 Duster 360 8 0 3 14.3 360 #> 10 Merc 240D 4 1 4 24.4 147. #> 11 Merc 230 4 1 4 22.8 141. #> 12 Merc 280 6 1 4 19.2 168. # 3. Ordering by a simple expression mycars_tbl %>% arrange(model != "Hornet Sportabout") #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Hornet Sportabout 8 0 3 18.7 360 #> 2 Mazda RX4 6 0 4 21 160 #> 3 Mazda RX4 Wag 6 0 4 21 160 #> 4 Datsun 710 4 1 4 22.8 108 #> 5 Hornet 4 Drive 6 1 3 21.4 258 #> 6 Valiant 6 1 3 18.1 225 #> 7 Duster 360 8 0 3 14.3 360 #> 8 Merc 240D 4 1 4 24.4 147. #> 9 Merc 230 4 1 4 22.8 141. #> 10 Merc 280 6 1 4 19.2 168. #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA # 4. Ordering by a complex expression (positive numeric column) mycars_tbl %>% arrange(vs, ifelse(vs == 1, desc(mpg), mpg)) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Duster 360 8 0 3 14.3 360 #> 2 Hornet Sportabout 8 0 3 18.7 360 #> 3 Mazda RX4 6 0 4 21 160 #> 4 Mazda RX4 Wag 6 0 4 21 160 #> 5 Cadillac Fleetwood 8 0 3 NA NA #> 6 Merc 240D 4 1 4 24.4 147. #> 7 Datsun 710 4 1 4 22.8 108 #> 8 Merc 230 4 1 4 22.8 141. #> 9 Hornet 4 Drive 6 1 3 21.4 258 #> 10 Merc 280 6 1 4 19.2 168. #> 11 Valiant 6 1 3 18.1 225 #> 12 Honda Civic 4 1 4 NA NA Similarly to what we have seen in base R and ‘data.table’, the case of ordering by a complex ifelse condition which is applied to a character column (or a numeric column that contains positive and negative values) is also in ‘dplyr’ a bit trickier. Just using desc() on one part of the ifelse condition will not yield the desired result: mycars_tbl %>% arrange(vs, ifelse(vs == 1, desc(model), model)) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Cadillac Fleetwood 8 0 3 NA NA #> 2 Duster 360 8 0 3 14.3 360 #> 3 Hornet Sportabout 8 0 3 18.7 360 #> 4 Mazda RX4 6 0 4 21 160 #> 5 Mazda RX4 Wag 6 0 4 21 160 #> 6 Merc 240D 4 1 4 24.4 147. #> 7 Merc 280 6 1 4 19.2 168. #> 8 Valiant 6 1 3 18.1 225 #> 9 Datsun 710 4 1 4 22.8 108 #> 10 Honda Civic 4 1 4 NA NA #> 11 Hornet 4 Drive 6 1 3 21.4 258 #> 12 Merc 230 4 1 4 22.8 141. The reason for this is that under the hood desc() is a wrapper of -xtfrm() which is why the ifelse statement will coerce the result of the later with our original vector: ifelse(mycars_tbl$vs == 1, desc(mycars_tbl$model), mycars_tbl$model) #> [1] "Mazda RX4" "Mazda RX4 Wag" "-2" #> [4] "-5" "Hornet Sportabout" "-12" #> [7] "Duster 360" "-10" "-9" #> [10] "-11" "Cadillac Fleetwood" "-4" We can either wrap model in xtfrm() or, for readability, we can create a helper function which just wraps xtfrm(): # helper function for readability asc % arrange(vs, ifelse(vs == 1, desc(model), asc(model))) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Cadillac Fleetwood 8 0 3 NA NA #> 2 Duster 360 8 0 3 14.3 360 #> 3 Hornet Sportabout 8 0 3 18.7 360 #> 4 Mazda RX4 6 0 4 21 160 #> 5 Mazda RX4 Wag 6 0 4 21 160 #> 6 Valiant 6 1 3 18.1 225 #> 7 Merc 280 6 1 4 19.2 168. #> 8 Merc 240D 4 1 4 24.4 147. #> 9 Merc 230 4 1 4 22.8 141. #> 10 Hornet 4 Drive 6 1 3 21.4 258 #> 11 Honda Civic 4 1 4 NA NA #> 12 Datsun 710 4 1 4 22.8 108 When programmatically ordering rows with ‘dplyr’ we can use across() inside arrange(). across() lets us use either tidy-select syntax to select one or several columns. For example when we want to order by all columns except model, we can use across(!model) inside arrange(): # 5. Ordering by all columns of a data.frame mycars_tbl %>% arrange(across(!model)) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Datsun 710 4 1 4 22.8 108 #> 2 Merc 230 4 1 4 22.8 141. #> 3 Merc 240D 4 1 4 24.4 147. #> 4 Honda Civic 4 1 4 NA NA #> 5 Mazda RX4 6 0 4 21 160 #> 6 Mazda RX4 Wag 6 0 4 21 160 #> 7 Valiant 6 1 3 18.1 225 #> 8 Hornet 4 Drive 6 1 3 21.4 258 #> 9 Merc 280 6 1 4 19.2 168. #> 10 Duster 360 8 0 3 14.3 360 #> 11 Hornet Sportabout 8 0 3 18.7 360 #> 12 Cadillac Fleetwood 8 0 3 NA NA Another option that across() offers is to use tidy-select helper functions, like all_of(), which allows us to pass a character vector of column names to order by: # 6. Ordering by a list or vector of column names mycols % arrange(across(all_of(mycols))) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Duster 360 8 0 3 14.3 360 #> 2 Valiant 6 1 3 18.1 225 #> 3 Hornet Sportabout 8 0 3 18.7 360 #> 4 Merc 280 6 1 4 19.2 168. #> 5 Mazda RX4 6 0 4 21 160 #> 6 Mazda RX4 Wag 6 0 4 21 160 #> 7 Hornet 4 Drive 6 1 3 21.4 258 #> 8 Datsun 710 4 1 4 22.8 108 #> 9 Merc 230 4 1 4 22.8 141. #> 10 Merc 240D 4 1 4 24.4 147. #> 11 Honda Civic 4 1 4 NA NA #> 12 Cadillac Fleetwood 8 0 3 NA NA The usability of working programmatically with arrange() stops with the option to provide a character vector of column names. When we further want to specify which columns should be sorted in ascending or descending order, things get a little bit more complicated. In this case we should first construct a named vector, below arg_vec, containing TRUE or FALSE for “descending” or not. It should be named after the column names we want to order by. We can then use this vector in an anonymous function in which we subset it with dplyr::cur_column() in an if clause saying: if TRUE then use your current column values in descending order desc(.x) or else just use the current column values as they are .x. # 6. Ordering by vectors of column names and descending argument desc_vec # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Merc 240D 4 1 4 24.4 147. #> 2 Datsun 710 4 1 4 22.8 108 #> 3 Merc 230 4 1 4 22.8 141. #> 4 Hornet 4 Drive 6 1 3 21.4 258 #> 5 Mazda RX4 6 0 4 21 160 #> 6 Mazda RX4 Wag 6 0 4 21 160 #> 7 Merc 280 6 1 4 19.2 168. #> 8 Hornet Sportabout 8 0 3 18.7 360 #> 9 Valiant 6 1 3 18.1 225 #> 10 Duster 360 8 0 3 14.3 360 #> 11 Honda Civic 4 1 4 NA NA #> 12 Cadillac Fleetwood 8 0 3 NA NA Finally, the last example of ordering by a vector of matching patterns, is actually easier to tackle than it looks like. ‘dplyr’ offers us two ways of doing this. 1․ Pack all the magic in one pipe: We can take our base R approach with lapply() and grepl() use purrr::map() with a lambda function ~ instead and pipe the result into dplyr::bind_cols(). This last part is needed, since arrange() accepts data.frames in the ellipsis argument, but not a list. # 7. Ordering by a vector of matching patterns my_pattern % arrange( map(my_pattern, ~ !grepl(paste0("^", .x), model) ) %>% bind_cols() ) #> # A tibble: 12 × 6 #> model cyl vs gear mpg disp #> #> 1 Mazda RX4 6 0 4 21 160 #> 2 Mazda RX4 Wag 6 0 4 21 160 #> 3 Merc 240D 4 1 4 24.4 147. #> 4 Merc 230 4 1 4 22.8 141. #> 5 Merc 280 6 1 4 19.2 168. #> 6 Hornet 4 Drive 6 1 3 21.4 258 #> 7 Hornet Sportabout 8 0 3 18.7 360 #> 8 Datsun 710 4 1 4 22.8 108 #> 9 Valiant 6 1 3 18.1 225 #> 10 Duster 360 8 0 3 14.3 360 #> 11 Cadillac Fleetwood 8 0 3 NA NA #> 12 Honda Civic 4 1 4 NA NA 2․ Create an index list and splice it into arrange: Alternatively we can proceed in two steps. First create an index list, similar to our base R approach. Then, instead of using do.call we can splice the list as arguments to arrange() using the triple bang operator !!!. # 7. Ordering by a vector of matching patterns idx_ls % arrange(!!! idx_ls) # same result as above Summing up: Ordering rows with ‘dplyr’ The examples above show that ‘dplyr’ offers a very intuitive API for ordering rows. In most cases a call to arrange is enough to get our desired result. When ordering rows programmatically, ‘dplyr’ has us covered with across() and ‘tidyselect’ helper functions, like all_of(), which can be used inside arrange. ‘dplyr’ feels definitely more beginner-friendly than base R, since arrange() covers more common use cases, and across() seems to have less conceptional overhead compared to base’s do.call. However, this holds true only up to a certain grade of complexity. For the more advanced examples splicing !!! and subsetting with cur_colum() were needed, which do not differ much in terms of conceptional overhead. pandas To conclude this post, let’s look at how python’s ‘pandas’ library tackles our seven ordering challenges. First, lets import ‘pandas’ and read in the data: import pandas as pd mycars = pd.read_csv("mycars.csv", index_col = 0) There are several ways of reordering rows in a pandas DataFrame. Here we will focus on the sort_values() method for DataFrame objects. sort_values() can work in both directions, ordering rows or, alternatively, ordering columns. With our examples from above in mind, we will only look at ordering rows, leaving the axis argument in its default value 0. With this default setting, sort_values() takes a column name or list of column names to sort by. We can further specify for each column whether its values should be sorted in ascending or descending order by passing a list of True and False values to the ascending argument. With only those two arguments, by and ascending, we can easily handle the examples 1., 5. and 6., where we just need to specify column names and their order: # 1. sort one or several columns in ascending or descending order mycars.sort_values(by='mpg') (mycars. sort_values(by=['cyl', 'mpg'], ascending=[False, True]) ) # 5. order by all columns of a df (mycars. sort_values(by = list(mycars.columns)[1:]) ) # 6. order by list of string column names my_cols = ['mpg', 'disp'] my_order = [False, True] (mycars. sort_values(by = my_cols, ascending = my_order) ) # 1. sort one or several columns in ascending or descending order mycars.sort_values(by='mpg') #> model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 12 Honda Civic 4 1 4 NaN NaN (mycars. sort_values(by=['cyl', 'mpg'], ascending=[False, True]) ) # 5. order by all columns of a df #> model cyl vs gear mpg disp #> 7 Duster 360 8 0 3 14.3 360.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 6 Valiant 6 1 3 18.1 225.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 12 Honda Civic 4 1 4 NaN NaN (mycars. sort_values(by = list(mycars.columns)[1:]) ) # 6. order by list of string column names #> model cyl vs gear mpg disp #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 12 Honda Civic 4 1 4 NaN NaN #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 7 Duster 360 8 0 3 14.3 360.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN my_cols = ['mpg', 'disp'] my_order = [False, True] (mycars. sort_values(by = my_cols, ascending = my_order) ) #> model cyl vs gear mpg disp #> 8 Merc 240D 4 1 4 24.4 146.7 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 9 Merc 230 4 1 4 22.8 140.8 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 10 Merc 280 6 1 4 19.2 167.6 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 7 Duster 360 8 0 3 14.3 360.0 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 12 Honda Civic 4 1 4 NaN NaN Things get more complicated when we want to order by an expression which is not a column name. In our second example we want to order by a list of matching string names, ideally without transforming the data itself. Here we can use sort_values() key argument. key takes a function and applies it to all columns specified in by before ordering. The idea is to order by the model column, and apply a lambda function to it, that first turns it into a categorical variable and then sets our list of matching names cat_ls as new categories before ordering. Like in R, ‘pandas’ sorts categorical variables according to the order of their categories (in R: levels). # 2. order by list of strings with matching names cat_ls = ["Hornet Sportabout", "Cadillac Fleetwood", "Valiant", "Hornet 4 Drive", "Mazda RX4", "Mazda RX4 Wag", "Honda Civic", "Datsun 710", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"] (mycars. sort_values(by='model', key=lambda x: x .astype('category').cat.set_categories(cat_ls) ) ) #> model cyl vs gear mpg disp #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 6 Valiant 6 1 3 18.1 225.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 12 Honda Civic 4 1 4 NaN NaN #> 3 Datsun 710 4 1 4 22.8 108.0 #> 7 Duster 360 8 0 3 14.3 360.0 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 9 Merc 230 4 1 4 22.8 140.8 #> 10 Merc 280 6 1 4 19.2 167.6 A different way of tackling this problem is to set the model column as index (similar to rownames in R). Then reindex the data with our list of matching names, cat_ls, and finally resetting the index, so that model becomes a regular column again. # 2. order by list of strings with matching names (mycars. set_index('model'). reindex(cat_ls). reset_index() ) # output as above This is probably the ‘pandas’ way to go about this problem, but it is important that we know about the key argument and how to use it. If an expression is related to values of a column, like in example no. 3, where we want to sort one value to the top, applying a simple lambda function is probably the easiest approach. # 3. order by a simple logical expression (mycars. sort_values(by='model', key=lambda x: x != "Hornet Sportabout" ) ) #> model cyl vs gear mpg disp #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 7 Duster 360 8 0 3 14.3 360.0 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 9 Merc 230 4 1 4 22.8 140.8 #> 10 Merc 280 6 1 4 19.2 167.6 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 12 Honda Civic 4 1 4 NaN NaN Also example no. 7, ordering by a list of matching patterns, can be solved by using the key argument. However, this case is quite complex. Apart from the pattern we want to sort by, my_pattern, we need a dictionary to bring the patterns into an order, custom_dict. We then order by model, use the string replace method to replace the full model name with the name of the pattern, and finally map this transformed column over our custom dictionary. All values which are not matched by the dictionary are NaN and are automatically sorted last. # 7. order by a list of matching patterns my_pattern = ["Mazda", "Merc", "Hornet"] custom_dict = {k: v for v, k in enumerate(my_pattern)} (mycars. sort_values(by='model', key=lambda x: x .str.replace('^('+'|'.join(my_pattern)+').*', r"\1", regex = True) .map(custom_dict) ) ) #> model cyl vs gear mpg disp #> 1 Mazda RX4 6 0 4 21.0 160.0 #> 2 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 8 Merc 240D 4 1 4 24.4 146.7 #> 9 Merc 230 4 1 4 22.8 140.8 #> 10 Merc 280 6 1 4 19.2 167.6 #> 4 Hornet 4 Drive 6 1 3 21.4 258.0 #> 5 Hornet Sportabout 8 0 3 18.7 360.0 #> 3 Datsun 710 4 1 4 22.8 108.0 #> 6 Valiant 6 1 3 18.1 225.0 #> 7 Duster 360 8 0 3 14.3 360.0 #> 11 Cadillac Fleetwood 8 0 3 NaN NaN #> 12 Honda Civic 4 1 4 NaN NaN Finally, the most troublesome challenge in ‘pandas’ is no. 4: ordering by a complex expression. In this example we wanted to sort the model names in descending order when the engine is v-shaped, vs == 0, and in ascending order if it’s normally shaped, vs == 1. Inspired by this answer on StackOverflow, the idea is to first create an empty output DataFrame, below mycars2. We then loop over mycars grouped by vs and create a True or False variable, orderg, that checks if the group name grp_name is 0 or not. We then sort each group by='model' and pass the orderg flag to the ascending argument before appending the data to our output object. Finally, we drop and reset the index of mycars2 to restore the format of our inital data: # 4. order by a complex expression mycars2 = pd.DataFrame(data=None, columns=mycars.columns) for grp_name, grp_dat in mycars.groupby(['vs']): orderg = (grp_name==0) # ascending order if vs == 0 mycars2 = pd.concat([mycars2, grp_dat. reset_index(). sort_values(by='model', ascending=orderg)] ) # final tweaks mycars2 = (mycars2. drop('index', axis=1). reset_index(drop=True) ) mycars2 #> model cyl vs gear mpg disp #> 0 Cadillac Fleetwood 8 0 3 NaN NaN #> 1 Duster 360 8 0 3 14.3 360.0 #> 2 Hornet Sportabout 8 0 3 18.7 360.0 #> 3 Mazda RX4 6 0 4 21.0 160.0 #> 4 Mazda RX4 Wag 6 0 4 21.0 160.0 #> 5 Valiant 6 1 3 18.1 225.0 #> 6 Merc 280 6 1 4 19.2 167.6 #> 7 Merc 240D 4 1 4 24.4 146.7 #> 8 Merc 230 4 1 4 22.8 140.8 #> 9 Hornet 4 Drive 6 1 3 21.4 258.0 #> 10 Honda Civic 4 1 4 NaN NaN #> 11 Datsun 710 4 1 4 22.8 108.0 Summing up: Ordering rows with ‘pandas’: ‘pandas’ sort_values() method has a strong programmatic interface similar to data.table::setorderv(). This makes it easy to work with, when we have a lists of column names and Boolean values to specify the sorting direction. Ordering by expressions other than column names seems to be more effort in ‘pandas’ compared to R. As long as the expressions can be derived from column values, sort_values() key argument can help us tackle most of the ordering challenges. However, once the expression we want to order by is not directly related to a specific column, we have to fall back to a classic for loop, which feels somewhat clunky compared to what we have seen in R. Wrap-up This post turned out to be almost a book chapter on ordering rows. I hope you enjoyed it. If you have a better approach to one of the examples above or if you have a special ordering challenge that I haven’t considered, let me know in the comments below or via Twitter, Mastodon or Github. Session Info #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2023-02-22 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> data.table * 1.14.2 2021-09-27 [1] CRAN (R 4.2.0) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> reticulate * 1.26 2022-08-31 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ─ Python configuration ─────────────────────────────────────────────────────── #> python: /usr/local/Caskroom/miniconda/base/bin/python3.9 #> libpython: /usr/local/Caskroom/miniconda/base/lib/libpython3.9.dylib #> pythonhome: /usr/local/Caskroom/miniconda/base:/usr/local/Caskroom/miniconda/base #> version: 3.9.12 (main, Apr 5 2022, 01:53:17) [Clang 12.0.0 ] #> numpy: /usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/numpy #> numpy_version: 1.22.3 #> #> NOTE: Python version was forced by RETICULATE_PYTHON #> #> ────────────────────────────────────────────────────────────────────────────── " />

The ultimate guide to ordering rows in R

[This article was first published on R | Tim Tiefenbach, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Intro

Sorting rows in a data.frame is generally considered a straightforward task, which it mostly is – until it isn’t. It seems that the operation of ordering rows doesn’t get much attention in introductory books on data science. Both R for Data Science and Python for Data Analysis only touch the subject very briefly. So this post comes to the rescue.

Below we look at several ordering operations and examine how the three big paradigms in R, base R, ‘data.table’ and ‘dplyr’, compare in tackling different ordering operations.

We will look at seven challenges loosely ordered by their increasing complexity.

We are going to order rows according to …

  1. … one or several columns in ascending or descending order.
  2. … a character vector with matching names.
  3. … a simple expression.
  4. … a complex expression.
  5. … all columns in a data.frame.
  6. … a vector of column names.
  7. … a vector of matching patterns.

This post concludes by comparing how we would tackle the same problems in Python’s ‘pandas’ library.

Let’s start with the setup.

Setup

We take R’s built-in mtcars data, extract a couple of rows and columns to make it more compact, and introduce some NAs to get an understanding of what’s happening when the data includes missing values.

mycols <- c("cyl", "vs", "gear", "mpg", "disp")
myrows <- which(rownames(mtcars) %in% c("Cadillac Fleetwood", "Honda Civic"))
mycars <- mtcars[c(1:10, myrows), mycols]

mycars <- data.frame(model = rownames(mycars),
                     data.frame(mycars, row.names=NULL))

mycars[mycars$model %in% c("Cadillac Fleetwood", "Honda Civic"), c("mpg", "disp")] <- NA

mycars
#>                 model cyl vs gear  mpg  disp
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 7          Duster 360   8  0    3 14.3 360.0
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 10           Merc 280   6  1    4 19.2 167.6
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA

Finally let’s save this data to a csv file which we will read in later in Python:

write.csv(mycars, "mycars.csv")
# available at:
# read.csv("https://raw.githubusercontent.com/TimTeaFan/tt_website/main/content/post/2023-ordering-rows/mycars.csv")

base R

1. Ordering by one or several variables

Ordering rows of a data.frame in base R is simple: we subset the rows of a data.frame with the order() function called on one or more variables.

# order by one column ascending
mycars[order(mycars$mpg), ]
#>                 model cyl vs gear  mpg  disp
#> 7          Duster 360   8  0    3 14.3 360.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA

# order by two columns, first descending, second ascending:
mycars[order(-mycars$cyl, mycars$mpg), ]
#>                 model cyl vs gear  mpg  disp
#> 7          Duster 360   8  0    3 14.3 360.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 6             Valiant   6  1    3 18.1 225.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 12        Honda Civic   4  1    4   NA    NA

Important to note is that, as default, NAs are sorted to bottom of the data.frame (read more about sorting NA values below).

Base R’s order() has an argument na.last which is set to TRUE by default and sorts NA to the bottom:

mycars[order(mycars$mpg),]
#>                 model cyl vs gear  mpg  disp
#> 7          Duster 360   8  0    3 14.3 360.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA

Setting na.last = FALSE sorts NAs to the top:

mycars[order(mycars$mpg, na.last = FALSE),]
#>                 model cyl vs gear  mpg  disp
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA
#> 7          Duster 360   8  0    3 14.3 360.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 8           Merc 240D   4  1    4 24.4 146.7

A bit surprisingly na.last can also be set NA which will remove NAs from the vector or data.frame making it a combined filter and ordering operation:

mycars[order(mycars$mpg, na.last = NA),]
#>                model cyl vs gear  mpg  disp
#> 7         Duster 360   8  0    3 14.3 360.0
#> 6            Valiant   6  1    3 18.1 225.0
#> 5  Hornet Sportabout   8  0    3 18.7 360.0
#> 10          Merc 280   6  1    4 19.2 167.6
#> 1          Mazda RX4   6  0    4 21.0 160.0
#> 2      Mazda RX4 Wag   6  0    4 21.0 160.0
#> 4     Hornet 4 Drive   6  1    3 21.4 258.0
#> 3         Datsun 710   4  1    4 22.8 108.0
#> 9           Merc 230   4  1    4 22.8 140.8
#> 8          Merc 240D   4  1    4 24.4 146.7

For some users this behavior might not be very intuitive, which is why splitting ordering and filtering operations is preferable for code clarity.

This is especially relevant for users coming from ‘dplyr’ or ‘data.table’. dplyr::arrange() always orders NAs last and doesn’t come with an option to change this behavior. data.table::setorder() has an argument na.last, but (1) it can only be set TRUE or FALSE and (2) it defaults to the former which is the opposite of what base::order() does:

mycarsDT <- data.table::as.data.table(mycars)
data.table::setorder(mycarsDT, mpg)
mycarsDT
#>                  model cyl vs gear  mpg  disp
#>  1: Cadillac Fleetwood   8  0    3   NA    NA
#>  2:        Honda Civic   4  1    4   NA    NA
#>  3:         Duster 360   8  0    3 14.3 360.0
#>  4:            Valiant   6  1    3 18.1 225.0
#>  5:  Hornet Sportabout   8  0    3 18.7 360.0
#>  6:           Merc 280   6  1    4 19.2 167.6
#>  7:          Mazda RX4   6  0    4 21.0 160.0
#>  8:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  9:     Hornet 4 Drive   6  1    3 21.4 258.0
#> 10:         Datsun 710   4  1    4 22.8 108.0
#> 11:           Merc 230   4  1    4 22.8 140.8
#> 12:          Merc 240D   4  1    4 24.4 146.7

How different vector types are sorted and some more information about what exactly happens, when we call df[order(variable), ] can be found in the info box below.

How does the above syntax work? Lets look at a three-column data.frame

dat <- data.frame(x = c(100, 1, 10), y = c("a", "c", "b"), z = c(TRUE, FALSE, TRUE))
dat
#>     x y     z
#> 1 100 a  TRUE
#> 2   1 c FALSE
#> 3  10 b  TRUE

The output of order(dat$x) shows us the row numbers in which we would have to order our data.frame to make the values in dat$x run from smallest to highest.

order(dat$x)
#> [1] 2 3 1

We can read this as: “The second row should come first, the third row should come second, and the first row should come last”. To actually order the rows according to this logic, we subset the data.frame by its rows according to the logic df[row_index, ]:

dat <- dat[order(dat$x), ]
dat
#>     x y     z
#> 2   1 c FALSE
#> 3  10 b  TRUE
#> 1 100 a  TRUE

Now the rows are in the desired order, but note, that the row indices still correspond to the original indices. This makes it (relatively) easy to restore the old ordering:

dat <- dat[order(as.numeric(rownames(dat))), ]
dat
#>     x y     z
#> 1 100 a  TRUE
#> 2   1 c FALSE
#> 3  10 b  TRUE

Note hat the logic of ordering is the same for the most common data types. order() sorts the values from smallest to largest. For integer and double vectors (including Dates and date times, like POSIXct) this is pretty straightforward. Also for character vectors the logic is simple: "a" is “smaller” than "b", so the ascending order goes from A to Z, with small coming before capital letters. We can always verify how character values relate:

"a" > "A"
#> [1] FALSE

For logical vectors FALSE can be read as 0 and TRUE as 1 which is the ascending order in which the values will be sorted. Although this makes perfectly sense, given that this is the way logical vectors are coerced to numeric, we will see later that this might be confusing when specifying explict values to sort by.

Finally, the only special case are factor variables. Here the sorting follows the factor levels():

dat$y <- factor(dat$y, levels = c("b", "a", "c"))
dat[order(dat$y), ]
#>     x y     z
#> 3  10 b  TRUE
#> 1 100 a  TRUE
#> 2   1 c FALSE

2. Ordering by a character vector with matching names

Sometimes sorting by numeric and character variables in ascending or descending order is not enough. There are cases where we have a given non-alphabetical order of names which we want to apply to our data. In this case we use factor variables.

Let’s assume we have a given order of model names that we want to sort our data by. Then we have two choices. We either transform the mycars$model column into a factor and supply our desired order of names as factor levels (see info box: “the logic of ordering rows” above).

# a character vector with matching names
my_vec <- c("Hornet Sportabout", "Cadillac Fleetwood", "Valiant",
            "Hornet 4 Drive", "Mazda RX4", "Mazda RX4 Wag", "Honda Civic",
            "Datsun 710", "Duster 360", "Merc 240D", "Merc 230", "Merc 280")

mycars2 <- mycars # let's create a new copy

# transform model column into factor and use level from my_vec
mycars2$model <- factor(mycars2$model, levels = my_vec) 

mycars2[order(mycars2$model), ]
#>                 model cyl vs gear  mpg  disp
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 6             Valiant   6  1    3 18.1 225.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 12        Honda Civic   4  1    4   NA    NA
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 7          Duster 360   8  0    3 14.3 360.0
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 10           Merc 280   6  1    4 19.2 167.6

However, we might not want to transform our original data and rather leave our model column untouched. In this case we can construct a factor variable “on the fly” and use it within order() without changing the data.frame itself:

mycars[order(factor(mycars$model, levels = my_vec)), ]
#>                 model cyl vs gear  mpg  disp
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 6             Valiant   6  1    3 18.1 225.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 12        Honda Civic   4  1    4   NA    NA
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 7          Duster 360   8  0    3 14.3 360.0
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 10           Merc 280   6  1    4 19.2 167.6

3. Ordering by a simple expression

Apart from sorting according to one or more variables, sometimes we want to sort according to a specific expression. Let’s say we want the row "Hornet Sportabout" to be sorted to the top of our data.frame. In this case, we can construct a logical vector mycars$model != "Hornet Sportabout" returning TRUE and FALSE for each row. Passing this to order() yields the desired result:

# bring one column to the top / bottom
mycars[order(mycars$model != "Hornet Sportabout"),]
#>                 model cyl vs gear  mpg  disp
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 7          Duster 360   8  0    3 14.3 360.0
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 10           Merc 280   6  1    4 19.2 167.6
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA

Note, that we negate the comparison with !=, since logical vectors are sorted from FALSE to TRUE (see info box “the logic of ordering rows” above).

4. Ordering by complex expressions

Above we saw how to order a data.frame by a simple logical expression. In base R we can easily extent this approach and apply more complex expressions within order(). Let’s say, for example, we want to sort our data in two groups, cars with ‘v-shaped engines’ vs == 1 and those with ‘straight engines’ vs == 1. In the first group we want to order the rows by mpg in decreasing order, and in the second group we want to order mpg in increasing order.

In this case we can supply mycars$vs as first argument to order() followed by the literal ifelse expression of the condition outlined above:

# ordering by one group and an ifelse expression on a numeric column
mycars[order(mycars$vs, ifelse(mycars$vs == 1, -mycars$mpg, mycars$mpg)), ]
#>                 model cyl vs gear  mpg  disp
#> 7          Duster 360   8  0    3 14.3 360.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 6             Valiant   6  1    3 18.1 225.0
#> 12        Honda Civic   4  1    4   NA    NA

The reason why this ordering operation yields the desired output, is because mycars$mpg is a numeric variable that only contains positive values, so we can reverse the values within the ifelse clause by just prefixing mpg with a minus symbol -.

So basically we are sorting by this vector which we generate on the fly:

ifelse(mycars$vs == 1, -mycars$mpg, mycars$mpg)
#>  [1]  21.0  21.0 -22.8 -21.4  18.7 -18.1  14.3 -24.4 -22.8 -19.2    NA    NA

Let’s say we want to order the model names in ascending and descending order based on the engine shape vs. In this case we first need to transform the model column into a rank for which we can use base::xtfrm(). Then we can just reverse the so generated ranks using the minus symbol -.

# ordering by one group and an ifelse expression on a character column
mycars[order(mycars$vs,
             ifelse(mycars$vs == 1,
                    -xtfrm(mycars$model),
                    xtfrm(mycars$model)
                    )
             ), ]
#>                 model cyl vs gear  mpg  disp
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 7          Duster 360   8  0    3 14.3 360.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 12        Honda Civic   4  1    4   NA    NA
#> 3          Datsun 710   4  1    4 22.8 108.0

As we will see later, other libraries also have trouble handling special ordering operations with complex expressions like this.

5. Ordering by all columns of a data.frame

Another common operation is to order by all variables in a data.frame. For our toy data this means we want to first sort cyl from 4 to 8, within cyl we want the rows to be sorted according to vs and ties here should be sorted according to gear, mpg and then disp.

We could just write out all variables as we did in the examples above:

mycars[order(mycars$cyl, mycars$vs, mycars$gear, mycars$mpg, mycars$disp),]

However, this is a lot of typing. Ideally we’d prefer a more programmatic way of sorting a data.frame by all variables. In base R, we can do this with do.call("order", args = list_of_vectors_to_sort_by). do.call basically constructs and evaluates a call to the specified function, here "order", and passes the list in the args argument to the arguments of the specified call.

In our case the list of vectors to sort by is the mycars data.frame itself except for the first column model, hence mycars[,-1]. Since we want to pass our list of vectors to order‘s ellipsis ... argument, the vectors in our list should be unnamed: unname(mycars[,-1]).

# order by all columns ascending
mycars[do.call("order", unname(mycars[, -1])), ]
#>                 model cyl vs gear  mpg  disp
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 9            Merc 230   4  1    4 22.8 140.8
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 12        Honda Civic   4  1    4   NA    NA
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 10           Merc 280   6  1    4 19.2 167.6
#> 7          Duster 360   8  0    3 14.3 360.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA

Forgetting to unname() our list of vectors can lead to problems, which will be elaborated in the warning box below.

The documentation of order() explicitly contains a warning, reminding us to unname() data.frames that we pass to do.call("order", df).

In most cases, forgetting to unname() a data.frame won’t get us in trouble. If, however, one of the column names corresponds to an argument of order(), that is na.last, decreasing or method, forgetting to unname() the data.frame will in most cases do something we don’t expect, and probably even notice.

Let’s create a small toy data.frame with three columns, x, y and decreasing, the latter indicating whether x - y is greater or less than zero:

dat <- data.frame(x = c(3, 3, 6),
                  y = c(5, 2, 4),
                  decreasing = c(FALSE, TRUE, TRUE))
dat
#>   x y decreasing
#> 1 3 5      FALSE
#> 2 3 2       TRUE
#> 3 6 4       TRUE

When we want to order by all columns in ascending order, using the do.call() approach showed above, we get the following result:

dat[do.call("order", dat), ]
#>   x y decreasing
#> 1 3 5      FALSE
#> 2 3 2       TRUE
#> 3 6 4       TRUE

When do.call() is constructing the call, it is actually evaluating this call:

#> order(list(x = c(3, 3, 6), y = c(5, 2, 4), decreasing = c(FALSE, 
#> TRUE, TRUE)))

As we can see, the third column is passed to order()s decreasing argument. It is interesting to note that this doesn’t seem to throw an error, although we order by two vectors and pass three Boolean values to decreasing.

When we unname() our data.frame before passing it to do.call(), we get the correct (and slightly different) result:

dat[do.call("order", unname(dat)), ]
#>   x y decreasing
#> 2 3 2       TRUE
#> 1 3 5      FALSE
#> 3 6 4       TRUE

Under the hood, this evaluates to the correct call:

#> order(list(c(3, 3, 6), c(5, 2, 4), c(FALSE, TRUE, TRUE)))

To sum-up: it is good practice to unname() data.frames when passing them to do.call().

6. Ordering by a list or vector of column names

Similar to sorting a data.frame by all variables, we sometimes have a vector of variables names we want to sort by. Here we can apply the same approach as above and use do.call("order", my_df[,mycols]) on our data.frame. Let’s further assume that we want to sort some columns ascending and some descending.

In this case we combine both arguments, the vectors to sort by and their decreasing order, in a list and supply it to do.call("order", our_list_of_arguments):

# Let's say we have the names of the columns ...
# ... we want to order by in a vector
mycols <- c("mpg", "cyl")

# Then we need to construct the arguments in list form ...
# 1. only the values of the columns to sort by (therefore `unname()`)
sort_df <- unname(mycars[,mycols])
# 2. the logical values for `order()`s `decreasing` argument:
desc_ls <- list(decreasing = c(TRUE, FALSE))

# we combine the arguments inside `do.call()`
mycars[do.call("order", c(sort_df, desc_ls)),]

# the above is equivalent to 
mycars[order(-mycars$mpg, mycars$cyl),]

7. Ordering by a vector of matching patterns

In our last example we look at how to order a data.frame according to a vector of matching patterns. Let’s say we want to list all ‘Mazda’, ‘Merc’ and ‘Hornet’ cars first (in this order), and all other cars last. In this case, we can again use do.call("order", ...). As input we need a list of vectors checking if the specified pattern can be found in each row.

To do this we check with grepl() if mycars$model starts with (regex: ^) one of the three names "Mazda", "Merc" or "Hornet". Wrapping this in lapply gives us a list of logical vectors that we can use as arguments to our do.call. Since logical values are sorted from FALSE to TRUE we negate grepl with ! to sort those rows to the top.

# order by matching pattern
my_pattern <- c("Mazda", "Merc", "Hornet")

idx_ls <- lapply(my_pattern,
                 \(x) !grepl(paste0("^", x), mycars$model))

mycars[do.call("order", idx_ls), ]
#>                 model cyl vs gear  mpg  disp
#> 1           Mazda RX4   6  0    4 21.0 160.0
#> 2       Mazda RX4 Wag   6  0    4 21.0 160.0
#> 8           Merc 240D   4  1    4 24.4 146.7
#> 9            Merc 230   4  1    4 22.8 140.8
#> 10           Merc 280   6  1    4 19.2 167.6
#> 4      Hornet 4 Drive   6  1    3 21.4 258.0
#> 5   Hornet Sportabout   8  0    3 18.7 360.0
#> 3          Datsun 710   4  1    4 22.8 108.0
#> 6             Valiant   6  1    3 18.1 225.0
#> 7          Duster 360   8  0    3 14.3 360.0
#> 11 Cadillac Fleetwood   8  0    3   NA    NA
#> 12        Honda Civic   4  1    4   NA    NA

Summing up: Ordering rows in base R

Ordering in base R boils down to subsetting a data.frame by itself in a different order. We create this new order either by applying order() directly to one or several variables or expressions, or by wrapping it in a do.call() together with a list of arguments.

While the former can be considered an easy, straightforward operation, the later requires quite some knowledge about constructing calls with do.call() and the possible pitfalls we might encounter - think of: unname(). Nevertheless, once useRs have understood the advanced concept of do.call() and how to use it, more advanced ordering operations can be tackled easily well.

‘data․table’

When it comes to orderings rows ‘data.table’ is not much different than base R. Most of the ordering operations introduced above can be applied almost identically on a data.table. While the syntax resembles base R, ‘data.table’ is using its own implementation of order() under the hood, data.table:::forder(), which is optimized and much faster compared to base R.

In this section we will first look at how to use order() on the seven examples from above. The aim is to stay close to base R, but account for ‘data.table’s syntax specific features.

Apart from order(), ‘data.table’ comes with its two own ordering functions, setorder() and setorderv(), which modify a data.table object by reference - that is without making a copy. This makes them more memory efficient compared to the already optimized implementation of data.table:::forder().

The following code chucks use a data.table version of our data:

library(data.table)

mycarsDT <- as.data.table(mycars)

data.table specific syntax

One decisive difference between base R is that ‘data.table’ supports non-standard evaluation (NSE) within the subsetting / extracting [ expression.

This means we can refer bare column names like mpg instead of mycarsDT$mpg:

mycarsDT[order(mpg), ] # <-- this would throw an error in base R

Further, in data.table’s extraction function [ the j argument is optional, which is why we don’t need the trailing comma:

mycarsDT[order(mpg)]
#>                  model cyl vs gear  mpg  disp
#>  1:         Duster 360   8  0    3 14.3 360.0
#>  2:            Valiant   6  1    3 18.1 225.0
#>  3:  Hornet Sportabout   8  0    3 18.7 360.0
#>  4:           Merc 280   6  1    4 19.2 167.6
#>  5:          Mazda RX4   6  0    4 21.0 160.0
#>  6:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  7:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  8:         Datsun 710   4  1    4 22.8 108.0
#>  9:           Merc 230   4  1    4 22.8 140.8
#> 10:          Merc 240D   4  1    4 24.4 146.7
#> 11: Cadillac Fleetwood   8  0    3   NA    NA
#> 12:        Honda Civic   4  1    4   NA    NA

Note that, unlike base::order(), ‘data.table’s implementation doesn’t keep track of the ordered rows. The rownames range from 1 to 12 like before (in base R the rownames showed for each row where it was originally coming from, allowing us to restore the order).

With the above syntax features in mind, we can rewrite the first five ordering examples as follows:

# 1. Ordering by one or several variables
mycarsDT[order(mpg)]
mycarsDT[order(-cyl, mpg)]

# 2. Ordering by a character vector with matching names
mycarsDT[order(factor(model, levels = my_vec))]

# 3. Ordering by a simple expression
mycarsDT[order(model != "Hornet Sportabout")]

# 4. Ordering by a complex expression (positive numeric column)
mycarsDT[order(vs, ifelse(vs == 1, -mpg, mpg))]
# 4. Ordering by a complex expression (character column)
mycarsDT[order(vs, ifelse(vs == 1, -xtfrm(model), xtfrm(model)))]

# 5. Ordering by all columns of a data.frame
mycarsDT[do.call("order", unname(mycarsDT[, -1]))]
# 1. Ordering by one or several variables
mycarsDT[order(mpg)]
#>                  model cyl vs gear  mpg  disp
#>  1:         Duster 360   8  0    3 14.3 360.0
#>  2:            Valiant   6  1    3 18.1 225.0
#>  3:  Hornet Sportabout   8  0    3 18.7 360.0
#>  4:           Merc 280   6  1    4 19.2 167.6
#>  5:          Mazda RX4   6  0    4 21.0 160.0
#>  6:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  7:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  8:         Datsun 710   4  1    4 22.8 108.0
#>  9:           Merc 230   4  1    4 22.8 140.8
#> 10:          Merc 240D   4  1    4 24.4 146.7
#> 11: Cadillac Fleetwood   8  0    3   NA    NA
#> 12:        Honda Civic   4  1    4   NA    NA
mycarsDT[order(-cyl, mpg)]
#>                  model cyl vs gear  mpg  disp
#>  1:         Duster 360   8  0    3 14.3 360.0
#>  2:  Hornet Sportabout   8  0    3 18.7 360.0
#>  3: Cadillac Fleetwood   8  0    3   NA    NA
#>  4:            Valiant   6  1    3 18.1 225.0
#>  5:           Merc 280   6  1    4 19.2 167.6
#>  6:          Mazda RX4   6  0    4 21.0 160.0
#>  7:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  8:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  9:         Datsun 710   4  1    4 22.8 108.0
#> 10:           Merc 230   4  1    4 22.8 140.8
#> 11:          Merc 240D   4  1    4 24.4 146.7
#> 12:        Honda Civic   4  1    4   NA    NA

# 2. Ordering by a character vector with matching names
mycarsDT[order(factor(model, levels = my_vec))]
#>                  model cyl vs gear  mpg  disp
#>  1:  Hornet Sportabout   8  0    3 18.7 360.0
#>  2: Cadillac Fleetwood   8  0    3   NA    NA
#>  3:            Valiant   6  1    3 18.1 225.0
#>  4:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  5:          Mazda RX4   6  0    4 21.0 160.0
#>  6:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  7:        Honda Civic   4  1    4   NA    NA
#>  8:         Datsun 710   4  1    4 22.8 108.0
#>  9:         Duster 360   8  0    3 14.3 360.0
#> 10:          Merc 240D   4  1    4 24.4 146.7
#> 11:           Merc 230   4  1    4 22.8 140.8
#> 12:           Merc 280   6  1    4 19.2 167.6

# 3. Ordering by a simple expression
mycarsDT[order(model != "Hornet Sportabout")]
#>                  model cyl vs gear  mpg  disp
#>  1:  Hornet Sportabout   8  0    3 18.7 360.0
#>  2:          Mazda RX4   6  0    4 21.0 160.0
#>  3:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  4:         Datsun 710   4  1    4 22.8 108.0
#>  5:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  6:            Valiant   6  1    3 18.1 225.0
#>  7:         Duster 360   8  0    3 14.3 360.0
#>  8:          Merc 240D   4  1    4 24.4 146.7
#>  9:           Merc 230   4  1    4 22.8 140.8
#> 10:           Merc 280   6  1    4 19.2 167.6
#> 11: Cadillac Fleetwood   8  0    3   NA    NA
#> 12:        Honda Civic   4  1    4   NA    NA

# 4. Ordering by a complex expression (positive numeric column)
mycarsDT[order(vs, ifelse(vs == 1, -mpg, mpg))]
#>                  model cyl vs gear  mpg  disp
#>  1:         Duster 360   8  0    3 14.3 360.0
#>  2:  Hornet Sportabout   8  0    3 18.7 360.0
#>  3:          Mazda RX4   6  0    4 21.0 160.0
#>  4:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  5: Cadillac Fleetwood   8  0    3   NA    NA
#>  6:          Merc 240D   4  1    4 24.4 146.7
#>  7:         Datsun 710   4  1    4 22.8 108.0
#>  8:           Merc 230   4  1    4 22.8 140.8
#>  9:     Hornet 4 Drive   6  1    3 21.4 258.0
#> 10:           Merc 280   6  1    4 19.2 167.6
#> 11:            Valiant   6  1    3 18.1 225.0
#> 12:        Honda Civic   4  1    4   NA    NA
# 4. Ordering by a complex expression (character column)
mycarsDT[order(vs, ifelse(vs == 1, -xtfrm(model), xtfrm(model)))]
#>                  model cyl vs gear  mpg  disp
#>  1: Cadillac Fleetwood   8  0    3   NA    NA
#>  2:         Duster 360   8  0    3 14.3 360.0
#>  3:  Hornet Sportabout   8  0    3 18.7 360.0
#>  4:          Mazda RX4   6  0    4 21.0 160.0
#>  5:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  6:            Valiant   6  1    3 18.1 225.0
#>  7:           Merc 280   6  1    4 19.2 167.6
#>  8:          Merc 240D   4  1    4 24.4 146.7
#>  9:           Merc 230   4  1    4 22.8 140.8
#> 10:     Hornet 4 Drive   6  1    3 21.4 258.0
#> 11:        Honda Civic   4  1    4   NA    NA
#> 12:         Datsun 710   4  1    4 22.8 108.0

# 5. Ordering by all columns of a data.frame
mycarsDT[do.call("order", unname(mycarsDT[, -1]))]
#>                  model cyl vs gear  mpg  disp
#>  1:         Datsun 710   4  1    4 22.8 108.0
#>  2:           Merc 230   4  1    4 22.8 140.8
#>  3:          Merc 240D   4  1    4 24.4 146.7
#>  4:        Honda Civic   4  1    4   NA    NA
#>  5:          Mazda RX4   6  0    4 21.0 160.0
#>  6:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  7:            Valiant   6  1    3 18.1 225.0
#>  8:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  9:           Merc 280   6  1    4 19.2 167.6
#> 10:         Duster 360   8  0    3 14.3 360.0
#> 11:  Hornet Sportabout   8  0    3 18.7 360.0
#> 12: Cadillac Fleetwood   8  0    3   NA    NA

Note that the code chunk above shows only the ordering operation without assignment. To actually transform the data.table object we would need to assign the calls above to a new (or the same) object name.

When using a vector of column names to subset a data.table, as we did in the sixth example, we need to precede the vector containing the column names (here: mycol) with a double dot .. to tell ‘data.table’ that we are looking for an external vector and not a column named mycol inside our data.table.

# 6. Ordering by a list or vector of column names
mycols <- c("mpg", "cyl")
sort_df <- unname(mycarsDT[, ..mycols])
mycarsDT[do.call("order", sort_df)]

However, our base R example was a bit more complex, since we also provided a logical vector to order()s decreasing argument. ‘data.table’s implementation of order(), data.table:::forder(), does only allow vectors of length one, which is why we can’t reproduce the full example from above using do.call() inside mycarsDT:

desc_ls <- list(decreasing = c(TRUE, FALSE))
mycarsDT[do.call("order", c(sort_df, desc_ls))]
#> Error: isTRUEorFALSE(decreasing) is not TRUE

We will see below, that there is a better way of achieving the desired outcome. Finally, our last example can again be reproduced very similar to our base R approach.

# 7. Ordering by a vector of matching patterns
my_pattern <- c("Mazda", "Merc", "Hornet")

idx_ls <- lapply(my_pattern,
                 \(x) !grepl(paste0("^", x), mycarsDT$model))

mycarsDT[do.call("order", idx_ls)]
#>                  model cyl vs gear  mpg  disp
#>  1:          Mazda RX4   6  0    4 21.0 160.0
#>  2:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  3:          Merc 240D   4  1    4 24.4 146.7
#>  4:           Merc 230   4  1    4 22.8 140.8
#>  5:           Merc 280   6  1    4 19.2 167.6
#>  6:     Hornet 4 Drive   6  1    3 21.4 258.0
#>  7:  Hornet Sportabout   8  0    3 18.7 360.0
#>  8:         Datsun 710   4  1    4 22.8 108.0
#>  9:            Valiant   6  1    3 18.1 225.0
#> 10:         Duster 360   8  0    3 14.3 360.0
#> 11: Cadillac Fleetwood   8  0    3   NA    NA
#> 12:        Honda Civic   4  1    4   NA    NA

data.table’s setorder functions

‘data.table’ comes with its own two ordering functions, setorder() and setorderv(), which modify a data.table object “by reference”, that is without making a copy. This is especially helpful when we are dealing with data that takes up a lot of memory and where we want to avoid unnecessary copies.

As first argument, x, both functions take a data.table. As second argument setorder() uses the ellipsis ... which allows us to supply one or several bare column names to order by. setorder() sorts in ascending order as default and allows the minus symbol - as prefix to sort a column in decreasing order. setorder() can be straight forward applied to our first example:

# 1. Ordering by one or several variables
setorder(mycarsDT, mpg)
setorder(mycarsDT, -mpg, cyl, na.last = TRUE)

setorderv()s second argument is cols which takes a character vector of column names and defaults to the column names of the data.table supplied in x. To specify an ascending or descending order we can supply a numeric vector of 1 and -1 to the order argument. setorderv() can be applied to the examples 5. (ordering by all columns) and 6. (ordering by a vector of column names):

# 5. Ordering by all columns of a data.frame
setorderv(mycarsDT,
          cols = names(mycarsDT)[-1],
          na.last = TRUE)

# 6. Ordering by a list or vector of column names
setorderv(mycarsDT,
          cols = mycols,
          order = c(-1,1),
          na.last = TRUE)

Unlike base::order(), both functions default to sorting NAs first. We need to set na.last = TRUE to reproduce our base R examples from above.

Unfortunately, setorder() doesn’t support arbitrary expressions in the ellipsis .... We must only use bare column names and, optionally, a minus symbol as prefix -. All other expressions will throw an error. Similarly, setorderv() only accepts a character vector of column names. So there is no straightforward way to apply either function to the examples 2., 3., 4. and 7. from above.

# Examples 2., 3. and 4. aren't working:
setorder(mycarsDT, factor(model, levels = my_vec), na.last = TRUE)
#> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: factor,my_vec
setorder(mycarsDT, model != "Hornet Sportabout", na.last = TRUE)
#> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: !=,Hornet Sportabout
setorder(mycarsDT, ifelse(vs == 1, -mpg, mpg), na.last = TRUE)
#> Error in setorderv(x, cols, order, na.last): some columns are not in the data.table: ifelse,vs == 1,-mpg

We can, however, come up with a workaround to harness ‘data.table’s power of memory efficiently modifying a data.table by reference.

In all four example the workaround is the same. We use setorder() and pass a modified data.table to it in which we create a new (or several) column(s) by reference. This new column contains the ordering logic. We use the extraction function [ right after setorder() to again delete the newly created column(s).

Let’s take a look at example 3., ordering by a simple logicl expression.

# 3. Ordering by a simple expression
setorder(
  mycarsDT[, ord := model != "Hornet Sportabout"],
  vs,
  ord,
  na.last = TRUE
  )[, ord := NULL]

# setorder() doesn't return the data, but changes it "in place"
# so to look at the reordered data we have to print it:
mycarsDT 
#>                  model cyl vs gear  mpg  disp
#>  1:  Hornet Sportabout   8  0    3 18.7 360.0
#>  2:          Mazda RX4   6  0    4 21.0 160.0
#>  3:      Mazda RX4 Wag   6  0    4 21.0 160.0
#>  4:         Duster 360   8  0    3 14.3 360.0
#>  5: Cadillac Fleetwood   8  0    3   NA    NA
#>  6:          Merc 240D   4  1    4 24.4 146.7
#>  7:         Datsun 710   4  1    4 22.8 108.0
#>  8:           Merc 230   4  1    4 22.8 140.8
#>  9:     Hornet 4 Drive   6  1    3 21.4 258.0
#> 10:           Merc 280   6  1    4 19.2 167.6
#> 11:            Valiant   6  1    3 18.1 225.0
#> 12:        Honda Civic   4  1    4   NA    NA

Exchanging the expression after the walrus operator := allows us to apply this approach to the other examples, like ordering by factor levels on the fly factor(model, levels = my_vec) or by a complex expression like ifelse(vs == 1, -mpg, mpg).

The approach above can be slightly adapted to help us with example no. 7, ordering by a vector of matching patterns. Here we first create three new variable names to order by order_cols <- paste0("ord", 1:3). Then we use setorderv(), pass a modified version of the mycarsDT to it, in which we actually create our three new order_cols. The lapply call on the right side of the walrus operator is the same as from our base R example above. We tell setorderv to order by our newly created columns and then delete those columns again right after the execution of setorderv() with [, (order_cols) := NULL].

# 7. Ordering by a vector of matching patterns
order_cols <- paste0("ord", 1:3)

setorderv(
  mycarsDT[, (order_cols) :=
             lapply(my_pattern, \(x) !grepl(paste0("^", x), model))],
  cols = order_cols,
  na.last = TRUE
  )[, (order_cols) := NULL]

Summing up: Ordering rows with ‘data.table’

Above we saw that using order() on data.table is very similar to using order() on a data.frame in base R. Due to ‘data.table’s special syntax the calls are less verbose and thanks to its own implementation of order() the performance is much faster compared to base R.

With order() all but one of the challenges from above could be tackled. The only problem we encountered was that data.table:::forder() doesn’t allow vectors of length greater than one in the decreasing argument.

Apart from order() ‘data.table’ has two special ordering functions setorder() and setorderv() which are more memory efficient. Both functions can be easily applied when we want to order by columns in increasing or decreasing order. However, when using expressions other than column names, workarounds are needed to use the full potential of ‘data.table’ setorder functions.

dplyr

‘dplyr’ is known to be a “consistent grammar of data manipulation”. It comes with several so called one-table verbs, which cover the most basic data operations - and ordering rows is one of them.

To order rows of a tibble or data.frame we use dplyr::arrange(). The first argument is the data.frame we want to order, and the second argument is the ellipsis ... allowing us to provide one or several expressions to order by.

As default dplyr::arrange() orders columns in ascending order. To reverse this, we can wrap column names in dplyr::desc(). NA are always sorted last, and there is no argument to change this behavior. Finally, arrange() is one of the few one-table verbs that ignores groupings of a data.frame, but this behavior can be changed (see info box below).

Although somewhat counterintuitive dplyr::arrange() does ignore (but preserve) groupings of a data.frame:

mycars_tbl %>% 
  group_by(cyl) %>% 
  arrange(mpg)
#> # A tibble: 12 × 6
#> # Groups:   cyl [3]
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Duster 360             8     0     3  14.3  360 
#>  2 Valiant                6     1     3  18.1  225 
#>  3 Hornet Sportabout      8     0     3  18.7  360 
#>  4 Merc 280               6     1     4  19.2  168.
#>  5 Mazda RX4              6     0     4  21    160 
#>  6 Mazda RX4 Wag          6     0     4  21    160 
#>  7 Hornet 4 Drive         6     1     3  21.4  258 
#>  8 Datsun 710             4     1     4  22.8  108 
#>  9 Merc 230               4     1     4  22.8  141.
#> 10 Merc 240D              4     1     4  24.4  147.
#> 11 Cadillac Fleetwood     8     0     3  NA     NA 
#> 12 Honda Civic            4     1     4  NA     NA

Here we’d probably expected the data to be sorted by mpg in ascending order within each group of cyl:

mycars_tbl %>%  
  arrange(cyl, mpg)
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Datsun 710             4     1     4  22.8  108 
#>  2 Merc 230               4     1     4  22.8  141.
#>  3 Merc 240D              4     1     4  24.4  147.
#>  4 Honda Civic            4     1     4  NA     NA 
#>  5 Valiant                6     1     3  18.1  225 
#>  6 Merc 280               6     1     4  19.2  168.
#>  7 Mazda RX4              6     0     4  21    160 
#>  8 Mazda RX4 Wag          6     0     4  21    160 
#>  9 Hornet 4 Drive         6     1     3  21.4  258 
#> 10 Duster 360             8     0     3  14.3  360 
#> 11 Hornet Sportabout      8     0     3  18.7  360 
#> 12 Cadillac Fleetwood     8     0     3  NA     NA

However, arrange() has an argument .by_group which is set to FALSE as default. Changing this to TRUE will make arrange work in the way we expected it:

mycars_tbl %>% 
  group_by(cyl) %>% 
  arrange(mpg, .by_group = TRUE)
#> # A tibble: 12 × 6
#> # Groups:   cyl [3]
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Datsun 710             4     1     4  22.8  108 
#>  2 Merc 230               4     1     4  22.8  141.
#>  3 Merc 240D              4     1     4  24.4  147.
#>  4 Honda Civic            4     1     4  NA     NA 
#>  5 Valiant                6     1     3  18.1  225 
#>  6 Merc 280               6     1     4  19.2  168.
#>  7 Mazda RX4              6     0     4  21    160 
#>  8 Mazda RX4 Wag          6     0     4  21    160 
#>  9 Hornet 4 Drive         6     1     3  21.4  258 
#> 10 Duster 360             8     0     3  14.3  360 
#> 11 Hornet Sportabout      8     0     3  18.7  360 
#> 12 Cadillac Fleetwood     8     0     3  NA     NA

The following code chunks use a tibble version of our data:

library(dplyr)
library(purrr)
mycars_tbl <- as_tibble(mycars)

Similar to ‘data.table’, ‘dplyr’ uses non-standard evaluation, which is why we can use bare column names inside arrange() without referring to our data.frame with mycars_tbl$.

# 1. Ordering by one or several variables
mycars_tbl %>% 
  arrange(mpg)

mycars_tbl %>% 
  arrange(desc(cyl), mpg)

Since arrange() accepts not only bare column names, but any arbitrary expressions we can easily rewrite the examples 2. to 4. as follows:

(Note that the code chunks below show only the ordering operation without assignment. To actually transform the data object we would need to assign the calls above to a new, or the same, object name.)

# 2. Ordering by a character vector with matching names
my_vec <- c("Hornet Sportabout", "Cadillac Fleetwood", "Valiant",
             "Hornet 4 Drive", "Mazda RX4", "Mazda RX4 Wag", "Honda Civic",
             "Datsun 710", "Duster 360", "Merc 240D", "Merc 230", "Merc 280")

mycars_tbl %>% 
  arrange(factor(model, levels = my_vec))

# 3. Ordering by a simple expression
mycars_tbl %>% 
  arrange(model != "Hornet Sportabout")

# 4. Ordering by a complex expression (positive numeric column)
mycars_tbl %>% 
  arrange(vs, ifelse(vs == 1, desc(mpg), mpg))
# 2. Ordering by a character vector with matching names
my_vec <- c("Hornet Sportabout", "Cadillac Fleetwood", "Valiant",
             "Hornet 4 Drive", "Mazda RX4", "Mazda RX4 Wag", "Honda Civic",
             "Datsun 710", "Duster 360", "Merc 240D", "Merc 230", "Merc 280")

mycars_tbl %>% 
  arrange(factor(model, levels = my_vec))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Hornet Sportabout      8     0     3  18.7  360 
#>  2 Cadillac Fleetwood     8     0     3  NA     NA 
#>  3 Valiant                6     1     3  18.1  225 
#>  4 Hornet 4 Drive         6     1     3  21.4  258 
#>  5 Mazda RX4              6     0     4  21    160 
#>  6 Mazda RX4 Wag          6     0     4  21    160 
#>  7 Honda Civic            4     1     4  NA     NA 
#>  8 Datsun 710             4     1     4  22.8  108 
#>  9 Duster 360             8     0     3  14.3  360 
#> 10 Merc 240D              4     1     4  24.4  147.
#> 11 Merc 230               4     1     4  22.8  141.
#> 12 Merc 280               6     1     4  19.2  168.

# 3. Ordering by a simple expression
mycars_tbl %>% 
  arrange(model != "Hornet Sportabout")
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Hornet Sportabout      8     0     3  18.7  360 
#>  2 Mazda RX4              6     0     4  21    160 
#>  3 Mazda RX4 Wag          6     0     4  21    160 
#>  4 Datsun 710             4     1     4  22.8  108 
#>  5 Hornet 4 Drive         6     1     3  21.4  258 
#>  6 Valiant                6     1     3  18.1  225 
#>  7 Duster 360             8     0     3  14.3  360 
#>  8 Merc 240D              4     1     4  24.4  147.
#>  9 Merc 230               4     1     4  22.8  141.
#> 10 Merc 280               6     1     4  19.2  168.
#> 11 Cadillac Fleetwood     8     0     3  NA     NA 
#> 12 Honda Civic            4     1     4  NA     NA

# 4. Ordering by a complex expression (positive numeric column)
mycars_tbl %>% 
  arrange(vs, ifelse(vs == 1, desc(mpg), mpg))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Duster 360             8     0     3  14.3  360 
#>  2 Hornet Sportabout      8     0     3  18.7  360 
#>  3 Mazda RX4              6     0     4  21    160 
#>  4 Mazda RX4 Wag          6     0     4  21    160 
#>  5 Cadillac Fleetwood     8     0     3  NA     NA 
#>  6 Merc 240D              4     1     4  24.4  147.
#>  7 Datsun 710             4     1     4  22.8  108 
#>  8 Merc 230               4     1     4  22.8  141.
#>  9 Hornet 4 Drive         6     1     3  21.4  258 
#> 10 Merc 280               6     1     4  19.2  168.
#> 11 Valiant                6     1     3  18.1  225 
#> 12 Honda Civic            4     1     4  NA     NA

Similarly to what we have seen in base R and ‘data.table’, the case of ordering by a complex ifelse condition which is applied to a character column (or a numeric column that contains positive and negative values) is also in ‘dplyr’ a bit trickier.

Just using desc() on one part of the ifelse condition will not yield the desired result:

mycars_tbl %>% 
  arrange(vs, ifelse(vs == 1, desc(model), model))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Cadillac Fleetwood     8     0     3  NA     NA 
#>  2 Duster 360             8     0     3  14.3  360 
#>  3 Hornet Sportabout      8     0     3  18.7  360 
#>  4 Mazda RX4              6     0     4  21    160 
#>  5 Mazda RX4 Wag          6     0     4  21    160 
#>  6 Merc 240D              4     1     4  24.4  147.
#>  7 Merc 280               6     1     4  19.2  168.
#>  8 Valiant                6     1     3  18.1  225 
#>  9 Datsun 710             4     1     4  22.8  108 
#> 10 Honda Civic            4     1     4  NA     NA 
#> 11 Hornet 4 Drive         6     1     3  21.4  258 
#> 12 Merc 230               4     1     4  22.8  141.

The reason for this is that under the hood desc() is a wrapper of -xtfrm() which is why the ifelse statement will coerce the result of the later with our original vector:

ifelse(mycars_tbl$vs == 1, desc(mycars_tbl$model), mycars_tbl$model)
#>  [1] "Mazda RX4"          "Mazda RX4 Wag"      "-2"                
#>  [4] "-5"                 "Hornet Sportabout"  "-12"               
#>  [7] "Duster 360"         "-10"                "-9"                
#> [10] "-11"                "Cadillac Fleetwood" "-4"

We can either wrap model in xtfrm() or, for readability, we can create a helper function which just wraps xtfrm():

# helper function for readability
asc <- function(x) xtfrm(x)

mycars_tbl %>% 
  arrange(vs, ifelse(vs == 1, desc(model), asc(model)))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Cadillac Fleetwood     8     0     3  NA     NA 
#>  2 Duster 360             8     0     3  14.3  360 
#>  3 Hornet Sportabout      8     0     3  18.7  360 
#>  4 Mazda RX4              6     0     4  21    160 
#>  5 Mazda RX4 Wag          6     0     4  21    160 
#>  6 Valiant                6     1     3  18.1  225 
#>  7 Merc 280               6     1     4  19.2  168.
#>  8 Merc 240D              4     1     4  24.4  147.
#>  9 Merc 230               4     1     4  22.8  141.
#> 10 Hornet 4 Drive         6     1     3  21.4  258 
#> 11 Honda Civic            4     1     4  NA     NA 
#> 12 Datsun 710             4     1     4  22.8  108

When programmatically ordering rows with ‘dplyr’ we can use across() inside arrange(). across() lets us use either tidy-select syntax to select one or several columns. For example when we want to order by all columns except model, we can use across(!model) inside arrange():

# 5. Ordering by all columns of a data.frame
mycars_tbl %>% 
  arrange(across(!model))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Datsun 710             4     1     4  22.8  108 
#>  2 Merc 230               4     1     4  22.8  141.
#>  3 Merc 240D              4     1     4  24.4  147.
#>  4 Honda Civic            4     1     4  NA     NA 
#>  5 Mazda RX4              6     0     4  21    160 
#>  6 Mazda RX4 Wag          6     0     4  21    160 
#>  7 Valiant                6     1     3  18.1  225 
#>  8 Hornet 4 Drive         6     1     3  21.4  258 
#>  9 Merc 280               6     1     4  19.2  168.
#> 10 Duster 360             8     0     3  14.3  360 
#> 11 Hornet Sportabout      8     0     3  18.7  360 
#> 12 Cadillac Fleetwood     8     0     3  NA     NA

Another option that across() offers is to use tidy-select helper functions, like all_of(), which allows us to pass a character vector of column names to order by:

# 6. Ordering by a list or vector of column names
mycols <- c("mpg", "cyl")

mycars_tbl %>% 
  arrange(across(all_of(mycols)))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Duster 360             8     0     3  14.3  360 
#>  2 Valiant                6     1     3  18.1  225 
#>  3 Hornet Sportabout      8     0     3  18.7  360 
#>  4 Merc 280               6     1     4  19.2  168.
#>  5 Mazda RX4              6     0     4  21    160 
#>  6 Mazda RX4 Wag          6     0     4  21    160 
#>  7 Hornet 4 Drive         6     1     3  21.4  258 
#>  8 Datsun 710             4     1     4  22.8  108 
#>  9 Merc 230               4     1     4  22.8  141.
#> 10 Merc 240D              4     1     4  24.4  147.
#> 11 Honda Civic            4     1     4  NA     NA 
#> 12 Cadillac Fleetwood     8     0     3  NA     NA

The usability of working programmatically with arrange() stops with the option to provide a character vector of column names. When we further want to specify which columns should be sorted in ascending or descending order, things get a little bit more complicated.

In this case we should first construct a named vector, below arg_vec, containing TRUE or FALSE for “descending” or not. It should be named after the column names we want to order by. We can then use this vector in an anonymous function in which we subset it with dplyr::cur_column() in an if clause saying: if TRUE then use your current column values in descending order desc(.x) or else just use the current column values as they are .x.

# 6. Ordering by vectors of column names and descending argument
desc_vec <- c(TRUE, FALSE)
arg_vec <- set_names(desc_vec, mycols)

mycars_tbl %>% 
  arrange(across(all_of(mycols),
                 ~ if (arg_vec[cur_column()]) desc(.x) else .x))
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Merc 240D              4     1     4  24.4  147.
#>  2 Datsun 710             4     1     4  22.8  108 
#>  3 Merc 230               4     1     4  22.8  141.
#>  4 Hornet 4 Drive         6     1     3  21.4  258 
#>  5 Mazda RX4              6     0     4  21    160 
#>  6 Mazda RX4 Wag          6     0     4  21    160 
#>  7 Merc 280               6     1     4  19.2  168.
#>  8 Hornet Sportabout      8     0     3  18.7  360 
#>  9 Valiant                6     1     3  18.1  225 
#> 10 Duster 360             8     0     3  14.3  360 
#> 11 Honda Civic            4     1     4  NA     NA 
#> 12 Cadillac Fleetwood     8     0     3  NA     NA

Finally, the last example of ordering by a vector of matching patterns, is actually easier to tackle than it looks like.

‘dplyr’ offers us two ways of doing this.

1․ Pack all the magic in one pipe:
We can take our base R approach with lapply() and grepl() use purrr::map() with a lambda function ~ instead and pipe the result into dplyr::bind_cols(). This last part is needed, since arrange() accepts data.frames in the ellipsis argument, but not a list.

# 7. Ordering by a vector of matching patterns
my_pattern <- c("Mazda", "Merc", "Hornet")

mycars_tbl %>% 
  arrange(
    map(my_pattern,
        ~ !grepl(paste0("^", .x), model)
        ) %>% bind_cols()
    )
#> # A tibble: 12 × 6
#>    model                cyl    vs  gear   mpg  disp
#>    <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4              6     0     4  21    160 
#>  2 Mazda RX4 Wag          6     0     4  21    160 
#>  3 Merc 240D              4     1     4  24.4  147.
#>  4 Merc 230               4     1     4  22.8  141.
#>  5 Merc 280               6     1     4  19.2  168.
#>  6 Hornet 4 Drive         6     1     3  21.4  258 
#>  7 Hornet Sportabout      8     0     3  18.7  360 
#>  8 Datsun 710             4     1     4  22.8  108 
#>  9 Valiant                6     1     3  18.1  225 
#> 10 Duster 360             8     0     3  14.3  360 
#> 11 Cadillac Fleetwood     8     0     3  NA     NA 
#> 12 Honda Civic            4     1     4  NA     NA

2․ Create an index list and splice it into arrange:
Alternatively we can proceed in two steps. First create an index list, similar to our base R approach. Then, instead of using do.call we can splice the list as arguments to arrange() using the triple bang operator !!!.

# 7. Ordering by a vector of matching patterns
idx_ls <- map(my_pattern,
              ~ !grepl(paste0("^", .x), mycars_tbl$model)
              )
 
mycars_tbl %>% 
  arrange(!!! idx_ls)
# same result as above

Summing up: Ordering rows with ‘dplyr’

The examples above show that ‘dplyr’ offers a very intuitive API for ordering rows. In most cases a call to arrange is enough to get our desired result. When ordering rows programmatically, ‘dplyr’ has us covered with across() and ‘tidyselect’ helper functions, like all_of(), which can be used inside arrange. ‘dplyr’ feels definitely more beginner-friendly than base R, since arrange() covers more common use cases, and across() seems to have less conceptional overhead compared to base’s do.call. However, this holds true only up to a certain grade of complexity. For the more advanced examples splicing !!! and subsetting with cur_colum() were needed, which do not differ much in terms of conceptional overhead.

pandas

To conclude this post, let’s look at how python’s ‘pandas’ library tackles our seven ordering challenges. First, lets import ‘pandas’ and read in the data:

import pandas as pd
mycars = pd.read_csv("mycars.csv", index_col = 0)

There are several ways of reordering rows in a pandas DataFrame. Here we will focus on the sort_values() method for DataFrame objects.

sort_values() can work in both directions, ordering rows or, alternatively, ordering columns. With our examples from above in mind, we will only look at ordering rows, leaving the axis argument in its default value 0.

With this default setting, sort_values() takes a column name or list of column names to sort by. We can further specify for each column whether its values should be sorted in ascending or descending order by passing a list of True and False values to the ascending argument.

With only those two arguments, by and ascending, we can easily handle the examples 1., 5. and 6., where we just need to specify column names and their order:

# 1. sort one or several columns in ascending or descending order
mycars.sort_values(by='mpg')

(mycars.
 sort_values(by=['cyl', 'mpg'],
             ascending=[False, True])
)

# 5. order by all columns of a df

(mycars.
 sort_values(by = list(mycars.columns)[1:])
)

# 6. order by list of string column names

my_cols = ['mpg', 'disp']
my_order = [False, True]
 
(mycars.
 sort_values(by = my_cols,
             ascending = my_order)
)
# 1. sort one or several columns in ascending or descending order
mycars.sort_values(by='mpg')

#>                  model  cyl  vs  gear   mpg   disp
#> 7           Duster 360    8   0     3  14.3  360.0
#> 6              Valiant    6   1     3  18.1  225.0
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 10            Merc 280    6   1     4  19.2  167.6
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 9             Merc 230    4   1     4  22.8  140.8
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 12         Honda Civic    4   1     4   NaN    NaN

(mycars.
 sort_values(by=['cyl', 'mpg'],
             ascending=[False, True])
)

# 5. order by all columns of a df

#>                  model  cyl  vs  gear   mpg   disp
#> 7           Duster 360    8   0     3  14.3  360.0
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 6              Valiant    6   1     3  18.1  225.0
#> 10            Merc 280    6   1     4  19.2  167.6
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 9             Merc 230    4   1     4  22.8  140.8
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 12         Honda Civic    4   1     4   NaN    NaN

(mycars.
 sort_values(by = list(mycars.columns)[1:])
)

# 6. order by list of string column names

#>                  model  cyl  vs  gear   mpg   disp
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 9             Merc 230    4   1     4  22.8  140.8
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 12         Honda Civic    4   1     4   NaN    NaN
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 6              Valiant    6   1     3  18.1  225.0
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 10            Merc 280    6   1     4  19.2  167.6
#> 7           Duster 360    8   0     3  14.3  360.0
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN

my_cols = ['mpg', 'disp']
my_order = [False, True]
 
(mycars.
 sort_values(by = my_cols,
             ascending = my_order)
)

#>                  model  cyl  vs  gear   mpg   disp
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 9             Merc 230    4   1     4  22.8  140.8
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 10            Merc 280    6   1     4  19.2  167.6
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 6              Valiant    6   1     3  18.1  225.0
#> 7           Duster 360    8   0     3  14.3  360.0
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 12         Honda Civic    4   1     4   NaN    NaN

Things get more complicated when we want to order by an expression which is not a column name. In our second example we want to order by a list of matching string names, ideally without transforming the data itself.

Here we can use sort_values() key argument. key takes a function and applies it to all columns specified in by before ordering. The idea is to order by the model column, and apply a lambda function to it, that first turns it into a categorical variable and then sets our list of matching names cat_ls as new categories before ordering. Like in R, ‘pandas’ sorts categorical variables according to the order of their categories (in R: levels).

# 2. order by list of strings with matching names
cat_ls = ["Hornet Sportabout", "Cadillac Fleetwood", "Valiant",
          "Hornet 4 Drive", "Mazda RX4", "Mazda RX4 Wag", "Honda Civic",
          "Datsun 710", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"]
 
(mycars.
 sort_values(by='model',
             key=lambda x: x
             .astype('category').cat.set_categories(cat_ls)
             )
)

#>                  model  cyl  vs  gear   mpg   disp
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 6              Valiant    6   1     3  18.1  225.0
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 12         Honda Civic    4   1     4   NaN    NaN
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 7           Duster 360    8   0     3  14.3  360.0
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 9             Merc 230    4   1     4  22.8  140.8
#> 10            Merc 280    6   1     4  19.2  167.6

A different way of tackling this problem is to set the model column as index (similar to rownames in R). Then reindex the data with our list of matching names, cat_ls, and finally resetting the index, so that model becomes a regular column again.

# 2. order by list of strings with matching names
(mycars.
 set_index('model').
 reindex(cat_ls).
 reset_index()
)
# output as above

This is probably the ‘pandas’ way to go about this problem, but it is important that we know about the key argument and how to use it. If an expression is related to values of a column, like in example no. 3, where we want to sort one value to the top, applying a simple lambda function is probably the easiest approach.

# 3. order by a simple logical expression
(mycars.
 sort_values(by='model',
             key=lambda x: x != "Hornet Sportabout"
             )
)

#>                  model  cyl  vs  gear   mpg   disp
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 6              Valiant    6   1     3  18.1  225.0
#> 7           Duster 360    8   0     3  14.3  360.0
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 9             Merc 230    4   1     4  22.8  140.8
#> 10            Merc 280    6   1     4  19.2  167.6
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 12         Honda Civic    4   1     4   NaN    NaN

Also example no. 7, ordering by a list of matching patterns, can be solved by using the key argument. However, this case is quite complex.

Apart from the pattern we want to sort by, my_pattern, we need a dictionary to bring the patterns into an order, custom_dict. We then order by model, use the string replace method to replace the full model name with the name of the pattern, and finally map this transformed column over our custom dictionary. All values which are not matched by the dictionary are NaN and are automatically sorted last.

# 7. order by a list of matching patterns
my_pattern = ["Mazda", "Merc", "Hornet"]
 
custom_dict = {k: v for v, k in enumerate(my_pattern)}
 
(mycars.
 sort_values(by='model',
             key=lambda x: x
             .str.replace('^('+'|'.join(my_pattern)+').*',
                          r"\1",
                          regex = True)
             .map(custom_dict)
             )
)

#>                  model  cyl  vs  gear   mpg   disp
#> 1            Mazda RX4    6   0     4  21.0  160.0
#> 2        Mazda RX4 Wag    6   0     4  21.0  160.0
#> 8            Merc 240D    4   1     4  24.4  146.7
#> 9             Merc 230    4   1     4  22.8  140.8
#> 10            Merc 280    6   1     4  19.2  167.6
#> 4       Hornet 4 Drive    6   1     3  21.4  258.0
#> 5    Hornet Sportabout    8   0     3  18.7  360.0
#> 3           Datsun 710    4   1     4  22.8  108.0
#> 6              Valiant    6   1     3  18.1  225.0
#> 7           Duster 360    8   0     3  14.3  360.0
#> 11  Cadillac Fleetwood    8   0     3   NaN    NaN
#> 12         Honda Civic    4   1     4   NaN    NaN

Finally, the most troublesome challenge in ‘pandas’ is no. 4: ordering by a complex expression. In this example we wanted to sort the model names in descending order when the engine is v-shaped, vs == 0, and in ascending order if it’s normally shaped, vs == 1.

Inspired by this answer on StackOverflow, the idea is to first create an empty output DataFrame, below mycars2. We then loop over mycars grouped by vs and create a True or False variable, orderg, that checks if the group name grp_name is 0 or not. We then sort each group by='model' and pass the orderg flag to the ascending argument before appending the data to our output object. Finally, we drop and reset the index of mycars2 to restore the format of our inital data:

# 4. order by a complex expression
mycars2 = pd.DataFrame(data=None, columns=mycars.columns)

for grp_name, grp_dat in mycars.groupby(['vs']):
    orderg = (grp_name==0) # ascending order if vs == 0
    mycars2 = pd.concat([mycars2,
                         grp_dat.
                          reset_index().
                          sort_values(by='model', ascending=orderg)]
                        )
 
# final tweaks        
mycars2 = (mycars2.
 drop('index', axis=1).
 reset_index(drop=True)
)
mycars2

#>                  model cyl vs gear   mpg   disp
#> 0   Cadillac Fleetwood   8  0    3   NaN    NaN
#> 1           Duster 360   8  0    3  14.3  360.0
#> 2    Hornet Sportabout   8  0    3  18.7  360.0
#> 3            Mazda RX4   6  0    4  21.0  160.0
#> 4        Mazda RX4 Wag   6  0    4  21.0  160.0
#> 5              Valiant   6  1    3  18.1  225.0
#> 6             Merc 280   6  1    4  19.2  167.6
#> 7            Merc 240D   4  1    4  24.4  146.7
#> 8             Merc 230   4  1    4  22.8  140.8
#> 9       Hornet 4 Drive   6  1    3  21.4  258.0
#> 10         Honda Civic   4  1    4   NaN    NaN
#> 11          Datsun 710   4  1    4  22.8  108.0

Summing up: Ordering rows with ‘pandas’:

‘pandas’ sort_values() method has a strong programmatic interface similar to data.table::setorderv(). This makes it easy to work with, when we have a lists of column names and Boolean values to specify the sorting direction. Ordering by expressions other than column names seems to be more effort in ‘pandas’ compared to R. As long as the expressions can be derived from column values, sort_values() key argument can help us tackle most of the ordering challenges. However, once the expression we want to order by is not directly related to a specific column, we have to fall back to a classic for loop, which feels somewhat clunky compared to what we have seen in R.

Wrap-up

This post turned out to be almost a book chapter on ordering rows. I hope you enjoyed it. If you have a better approach to one of the examples above or if you have a special ordering challenge that I haven’t considered, let me know in the comments below or via Twitter, Mastodon or Github.

Session Info
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-02-22
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package    * version date (UTC) lib source
#>  data.table * 1.14.2  2021-09-27 [1] CRAN (R 4.2.0)
#>  dplyr      * 1.1.0   2023-01-29 [1] CRAN (R 4.2.0)
#>  purrr      * 1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  reticulate * 1.26    2022-08-31 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ─ Python configuration ───────────────────────────────────────────────────────
#>  python:         /usr/local/Caskroom/miniconda/base/bin/python3.9
#>  libpython:      /usr/local/Caskroom/miniconda/base/lib/libpython3.9.dylib
#>  pythonhome:     /usr/local/Caskroom/miniconda/base:/usr/local/Caskroom/miniconda/base
#>  version:        3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ]
#>  numpy:          /usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/numpy
#>  numpy_version:  1.22.3
#>  
#>  NOTE: Python version was forced by RETICULATE_PYTHON
#> 
#> ──────────────────────────────────────────────────────────────────────────────
To leave a comment for the author, please follow the link and comment on their blog: R | Tim Tiefenbach.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)