Site icon R-bloggers

On indexing operators and composition

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).


A permutation of indices

In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.

R‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable).

R’s square-bracket array index operator

In R array or vector indexing is commonly denoted by the square-bracket “[]“. For example if we have an array of values we can read them off as follows:

array <- c('a', 'b', 'x', 'y')
print(array[1])
# [1] "a"

print(array[2])
# [1] "b"

print(array[3])
# [1] "x"

print(array[4])
# [1] "y"

A cool thing about R‘s array indexing operator is: you can pass in arrays or vectors of values and get many results back at the same time:

print(array[c(2,3)])
# [1] "b" "x"

You can even use this notation on the left-hand side (LHS) during assignment:

array[c(2,3)] <- 'zzz'
print(array)
# [1] "a"   "zzz" "zzz" "y"  

This ability to address any number of elements is the real power of R‘s array operator. However, if you know you only want one value I strongly suggest always using R‘s double-square operator “[[]]” which confirms you are selecting exactly one argument and is also the correct operator when dealing with lists.

Let’s get back to the single-square bracket “[]” and its vectorized behavior.

Application: computing ranks

Let’s use the square bracket to work with ranks.

Consider the following data.frame.

d <- data.frame(x= c('d', 'a', 'b', 'c'),
                origRow= 1:4,
                stringsAsFactors= FALSE)
print(d)

#   x origRow
# 1 d       1
# 2 a       2
# 3 b       3
# 4 c       4

Suppose we want to compute the rank of the x-values. This is easy as R has a built in rank-calculating function:

print(rank(d$x))
# [1] 4 1 2 3

Roughly (and ignoring controlling treatment of ties) rank calculation can also be accomplished by sorting the data so d$x is ordered, ranking in this trivial configuration (just writing an increasing sequence), and then returning the data to its original order. We are going to use R‘s order() command. This calculates a permutation such that data is in sorted order (in this article all permutations are represented as arrays of length n containing each of the integers from 1 through n exactly once). order() works as follows:

ord <- order(d$x)
print(ord)
# [1] 2 3 4 1

print(d$x[ord])
# [1] "a" "b" "c" "d"

The rank calculation written in terms of order() then looks like the following:

d2 <- d[ord, ]
d2$rankX <- 1:nrow(d2)
d3 <- d2[order(d2$origRow), ]
print(d3$rankX)
# [1] 4 1 2 3

And we again have the rankings.

Returning to the original order

Of particular interest are the many ways we can return d2 to original d$origRow-order. My absolute favorite way is indexing the left-side as in:

d4 <- d2 # scratch frame to ready for indexed assignment
d4[ord, ] <- d2 # invert by assignment
print(d4)

 #   x origRow rankX
 # 2 d       1     4
 # 3 a       2     1
 # 4 b       3     2
 # 1 c       4     3

The idea is d2 <- d[ord, ] applies the permutation represented by ord and d4[ord, ] <- d2 undoes the permutation represented by ord. The notation is so powerful it almost looks like declarative programing (and a lot like the explicit fixed-point operators we were able to write in R here).

Let’s see that again:

print(ord)
# [1] 2 3 4 1

invOrd <- numeric(length(ord)) # empty vector to ready for indexed assigment
invOrd[ord] <- 1:length(ord)   # invert by assignment

print(invOrd[ord])
# [1] 1 2 3 4

print(invOrd)
# [1] 4 1 2 3

We used the assignment invOrd[ord] <- 1:length(ord) to “say” we wanted invOrd[ord] to be “1 2 3 4” and it is “1 2 3 4“. This means invOrd looks like an inverse of ord, which is why it can undo the ord permutation. We can get d2 into the correct order by writing d2[invOrd, ].

Why did that indexing work?

To work out why the above transformations are correct we will need a couple of transform rules (both to be established later!):

The above follow from the fact that composition of permutations (here seen as composition of indexing) form a group similar to function composition, but let’s work up to that.

Here is our reasoning to show d4 has its rows back in the original order of those of d1:

  1. d2$origRow == (1:n)[ord] (as d2 == d[ord, ]), so d2$origRow == ord (as (1:n)[ord] == ord).
  2. d4$origRow[ord] == d2$origRow (by the left hand side assignment), so d4$origRow[ord] == ord (by the last step).

    If we could just cancel off the pesky “ord” from both sides of this equation we would be done. That is in fact how we continue, bringing in rules that justify the cancellation.

  3. (d4$origRow[ord])[invOrd] == d4$origRow[ord[invOrd]] (by associaitivity, which we will prove later!). So we can convert (d4$origRow[ord])[invOrd] == ord[invOrd] (derivable from the last step) into d4$origRow[ord[invOrd]] == ord[invOrd].
  4. Substituting in ord[invOrd] == 1:n (the other big thing we will show is: ord[invOrd] == invOrd[ord] == 1:n) yields d4$origRow == 1:n. This demonstrates d4‘s rows must be back in the right order (that which we were trying to show).

So all that remains is to discuss associativity, and also show why invOrd[ord] == 1:n (which was established by the assignment invOrd[ord] <- 1:length(ord)) implies ord[invOrd] == 1:n (what we actually used in our argument).

Some operator notation

To make later steps easier, let’s introduce some R-operator notation. Define:

With %[]% our claim about inverses is written as follows. For permutations on 1:n: ord %[]% invOrd == 1:n implies invOrd %[]% ord == 1:n. (Again we do not have a %[]% b == b %[]% a in general as shown by a <- c(2,1,3); b <- c(1,3,2).)

Semigroup axioms

In operator notation we claim sequenced selectionlowing are true (not all of which we will confirm here). Call an element “a permutation” if it is an array of length n containing each integer from 1 through exactly once. Call an element “a sequenced selection” if it is an array of length n containing only integers from 1 through (possibly with repetitions). n will be held at a single value throughout. All permutations are sequenced selections.

Permutations and sequenced selections can be confirmed to obey the following axioms:

  1. For all a, b sequenced selections we have a %[]% b is itself a sequenced selection 1:n. If both a, b are permutations then a %[]% b is also a permutation.
  2. (this is the big one) For all a, b, c sequenced selections we have: ( a %[]% b ) %[]% c == a %[]% ( b %[]% c ).
  3. For all a that are sequenced selections we have: (1:n) %[]% a == a and a %[]% (1:n) == a. We call 1:n the multiplicative identity and it is often denoted as e <- 1:n.
  4. For all “a” a permutation there exists La, Ra permutations of 1:n such that La %[]% a == 1:n and a %[]% Ra == 1:n. Also, we can derive La == Ra from the earlier axioms.

The above are essentially the axioms defining a mathematical object called a semigroup (for the sequenced selections) or group (for the permutations).


A famous realization of a permutation group: Rubik’s Cube.

Checking our operator “%[]%” obeys axioms 1 and 3 is fairly mechanical. Confirming axiom 4 roughly follows from the fact you can sort arrays (i.e. that order() works). Axiom 2 (associativity) is the amazing one. A lot of the power of groups comes from the associativity and a lot of the math of things like Monads is just heroic work trying to retain useful semigroup-like properties.


Nim: an associative irreversible state space.

A bit more on inverses

Suppose we had confirmed all of the above axioms, then our remaining job would be to confirm that invOrd[ord] == 1:n implies ord[invOrd] == 1:n. This is easy in the %[]% notation.

The argument is as follows:

Note:

An important property of %[]% and %.% is that they are fully associative over their values (permutations and functions respectively). We can safely re-parenthesize them (causing different execution order and different intermediate results) without changing outcomes. This is in contrast to %+%, %|>%, and %>% which are only partially associative (yet pick up most of their power from properly managing what associativity they do have).

%>% works well because it associates in the same direction as R‘s parser. We don’t write parenthesis in “-4 %>% abs %>% sqrt” because in this case it is unambiguous that it must be the case that the parenthesis are implicitly “( -4 %>% abs ) %>% sqrt” (by R left-associative user operator parsing) as “-4 %>% ( abs %>% sqrt)” would throw an exception (so post hoc ergo propter hoc could not be how R interpreted “-4 %>% abs %>% sqrt“, as that did not throw an exception). So it isn’t that both associations are equal (they are not), it is that only one of them is “well formed” and that one happens to be the way R‘s parser works. It isn’t that R‘s parser is magically looking ahead to solve this, it is just the conventions match.

%[]% is also neat in that values have an nice interpretation as functions over values. All the other operators are either more about functions (%.%) or more about values (%+%, %|>%, and %>%).

The amazing emergence of full associativity

Now consider the following two (somewhat complicated) valid R expressions involving permutations a, b, and c:

  1. a[b[c]]: which means calculate x <- b[c] and then calculate a[x].
  2. (a[b])[c]: which means calculate y <- a[b] and then calculate y[c].

Consider this as a possible a bar bet with programming friends: can they find two vectors that are permutations of 1:n (or even just length n vectors consisting of any combination of taken from the integers < coce>1:n) where the above two calculations disagree.

For example we could try the following:

n <- 4
a <- c(1,3,2,4)
b <- c(4,3,2,1)
c <- c(2,3,4,1)
x <- b[c]

print(a[x])

# [1] 2 3 1 4

y &lt- a[b]

print(y[c])
# [1] 2 3 1 4

# whoa, they are equal

It is an amazing fact for the types of values we are discussing we always have:

     a[b[c]] == (a[b])[c]

This is what we claim when we claim:

     a %[]% ( b %[]% c )  ==  ( a %[]% b ) %[]% c

(i.e., when we claim associativity).

The above means we can neglect parenthesis and unambiguously write “a %[]% b %[]% c” as both common ways of inserting the parenthesis yield equivalent values (though specify different execution order and entail different intermediate results).

Nested Indexing is an Instance of Function Composition

We can confirm associativity by working through all the details of array indexing (which would be a slog), or we can (as we will here) confirm this by an appeal to algebra. Either way the above claim is true, but sufficiently subtle that you certainly will not believe it without some more experience (which we will try to supply here). Associativity of indexing unintuitive mostly because it is unfamiliar; one rarely sees re-factoring code based on associativity.

As we mentioned above each different association or parenthesization specifies a different calculation, with different intermediate values- but they both result in the same value.

The proof is as follows. Consider each sequenced selection a, b, c as a function that maps the integers 1:n to the integers 1:n (with a(i) defined as a[i], and similar for b and c). Some inspection shows sequenced selections composed with the operator %[]% must behave just as functions composed under %.%. Function composition is famously fully associative, therefore (by the parallel behavior between %.% and %[]%) we know %[]% is fully associative.

Function composition being fully associative usually comes as a surprise to non-mathematicians. I doubt most users of math regularly think about this. Here is why function composition being fully associative is hard to grasp.

Some nice writing on proving function composition is associative can be found here: “Is composition of functions associative?”. Function composition is a very learnable concept (and well worth thinking about it). Do not worry if it takes you a long time to get comfortable with it. Nobody understands it quickly the first time (though it is a very sensible topic deeply understood by very many mathematical teachers and writers).

Conclusion

In this writeup we had the rare pleasure of showing two different implementations of a concept (nested indexing) are equivalent. In programming there are very few operations that are so regular and interchangeable. This is why I advocate design choices that preserve referential transparency (the statement you can safely substitute values for variables, which is one of the few things that let’s us reason about programs) to keep as many of these opportunities as practical.

At this point I hope you find the vectorized square-bracket as nifty as I do. It allows some very succinct expressions of powerful sorting and permutations steps. The “find the inverse by putting the square-bracket on the left side” is one of my favorite coding tricks, and actually quite useful in arranging data for analysis (especially ordered data such as time series, or when you need to work with ranks or quantiles). It always seems “a bit magic.” It really it is a bit magic, but it is also formally correct and reliable.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.