Site icon R-bloggers

Custom comparison function for sorting data in R

[This article was first published on r – log Fold Change, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many languages allow you to use a custom comparison function in sorting. R is not an exception, but it is not entirely straightforward – it requires you to define a new class and overload certain operators. Here is how to do it.

Consider the following example. You have a certain number of paired values, for example

v <- list(a=c(2,1), b=c(1,3), d=c(1,1), e=c(2,3))

The job is to order these pairs in the following way. Given two pairs, p1=(x1, y1) and p2=(x2, y2), p1 < p2 iff one of the following conditions is fulfilled: either x1 < x2 and y1 <= y2, or x1 <= x2 and y1 < y2. The point is that if we draw lines, where one end of the line is at the height x1, and the other end is at the height y1, we want to sort these lines only if they do not cross — at most, only if one of their ends overlaps (but not both, because then the lines would be identical):

On the figure above, left panel, p1 < p2, because one of the ends is below the end of the other line (x1 < x2 and y1=y2). Of course, if y1 < y2 the inequality still holds. On the other hand, the right panel shows a case where we cannot resolve the comparison; the lines cross, so we should treat them as equal.

If now we have a list of such pairs and want to order it, we will have a problem. Here is the thing: the desired order is {d, a, b, e}. The element d=(1,1) is clearly smaller (as defined above) than all the others. However, b=(1,3) is not smaller than a=(2,1), and a is not smaller than b; that means, that a is equal to b, and their order should not be modified.

There is no way to do that with regular tools such as order, especially since x and y may not only be on different scales — they might be even completely different data types! One might be a numeric vector, the other a character string. Or, possibly, a type of requisite from Monty Python (with a defined relation stating that a banana is less than a gun). We must use a custom comparator.

For this, we need to notice that the R functions sort and order rely on the function xtfrm. This in turns relies on the methods ==, &gt; and [, defined for a given class. For numeric vectors, for example, these give what you would expect.

Our v vector is a list with elements which are pairs of numbers. For this type of data, there is no comparison defined; and comparing two pairs of numbers results with a vector of two logical numbers, which is not what we want.

> v[1] < v[2]
Error in v[1] < v[2] : comparison of these types is not implemented
> v[[1]] < v[[2]]
[1] FALSE  TRUE

R, however, is an object oriented language (even if it does not always feel like that). Comparisons (“, ==) are generic functions and it is possible to define (or redefine) them for any class of objects. So here is the plan: we invent a new class for the object v, and define custom comparisons for the elements of this class of objects. Remember that if we define a function which name consists of a generic (like "plot" or "["), a dot, and a name of the class, we are defining the method for the given class:

## make v an object of class "foo"
class(v) <- "foo"

## to use the "extract" ([) method, 
## we need to momentarily change the class of x, because 
## otherwise we will end up in an endless loop
'[.foo' <- function(x, i) {
    class(x) <- "list"
    x <- x[i]
    class(x) <- "foo"
    x
}

## define ">" as stated above
## the weird syntax results from the fact that a and b
## are lists with one element, this element being a vector 
## of a pair of numbers
'>.foo' <- function(a,b) {
a <- a[[1]]
b <- b[[1]]
ifelse( (a[1] > b[1] && a[2] >= b[2])
                     ||
        (a[1] >= b[1] && a[2] > b[2]), TRUE, FALSE)
}

## if we can't find a difference, then there is no difference
'==.foo' <- function(a, b) 
    ifelse(a > b || b > a, FALSE, TRUE)

## we don't need that, but for the sake of completeness...
'<.foo' <- function(a, b) b > a

This will now do exactly what we want:

> v["a"] == v["b"]
[1] TRUE
> v["a"] > v["d"]
[1] TRUE
> sort(v)
$d
[1] 1 1

$a
[1] 2 1

$b
[1] 1 3

$e
[1] 2 3

attr(,"class")
[1] "foo"

To leave a comment for the author, please follow the link and comment on their blog: r – log Fold Change.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.