Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of permutation groups or semigroups: some very deep yet accessible pure mathematics).
In this article I will be working hard to convince you a very fundamental true statement is in fact true: array indexing is associative; and to simultaneously convince you that you should still consider this amazing (as it is a very strong claim with very many consequences). Array indexing respecting associative transformations should not be a-priori intuitive to the general programmer, as array indexing code is rarely re-factored or transformed, so programmers tend to have little experience with the effect. Consider this article an exercise to build the experience to make this statement a posteriori obvious, and hence something you are more comfortable using and relying on.
R
‘s array indexing notation is really powerful, so we will use it for our examples. This is going to be long (because I am trying to slow the exposition down enough to see all the steps and relations) and hard to follow without working examples (say with R
), and working through the logic with pencil and a printout (math is not a spectator sport). I can’t keep all the steps in my head without paper, so I don’t really expect readers to keep all the steps in their heads without paper (though I have tried to organize the flow of this article and signal intent often enough to make this readable).
R’s square-bracket array index operator
In R
array or vector indexing is commonly denoted by the square-bracket “[]
“. For example if we have an array of values we can read them off as follows:
array <- c('a', 'b', 'x', 'y') print(array[1]) # [1] "a" print(array[2]) # [1] "b" print(array[3]) # [1] "x" print(array[4]) # [1] "y"
A cool thing about R
‘s array indexing operator is: you can pass in arrays or vectors of values and get many results back at the same time:
print(array[c(2,3)]) # [1] "b" "x"
You can even use this notation on the left-hand side (LHS) during assignment:
array[c(2,3)] <- 'zzz' print(array) # [1] "a" "zzz" "zzz" "y"
This ability to address any number of elements is the real power of R
‘s array operator. However, if you know you only want one value I strongly suggest always using R
‘s double-square operator “[[]]
” which confirms you are selecting exactly one argument and is also the correct operator when dealing with lists.
Let’s get back to the single-square bracket “[]
” and its vectorized behavior.
Application: computing ranks
Let’s use the square bracket to work with ranks.
Consider the following data.frame
.
d <- data.frame(x= c('d', 'a', 'b', 'c'), origRow= 1:4, stringsAsFactors= FALSE) print(d) # x origRow # 1 d 1 # 2 a 2 # 3 b 3 # 4 c 4
Suppose we want to compute the rank of the x
-values. This is easy as R
has a built in rank-calculating function:
print(rank(d$x)) # [1] 4 1 2 3
Roughly (and ignoring controlling treatment of ties) rank calculation can also be accomplished by sorting the data so d$x
is ordered, ranking in this trivial configuration (just writing an increasing sequence), and then returning the data to its original order. We are going to use R
‘s order()
command. This calculates a permutation such that data is in sorted order (in this article all permutations are represented as arrays of length n
containing each of the integers from 1
through n
exactly once). order()
works as follows:
ord <- order(d$x) print(ord) # [1] 2 3 4 1 print(d$x[ord]) # [1] "a" "b" "c" "d"
The rank calculation written in terms of order()
then looks like the following:
d2 <- d[ord, ] d2$rankX <- 1:nrow(d2) d3 <- d2[order(d2$origRow), ] print(d3$rankX) # [1] 4 1 2 3
And we again have the rankings.
Returning to the original order
Of particular interest are the many ways we can return d2
to original d$origRow
-order. My absolute favorite way is indexing the left-side as in:
d4 <- d2 # scratch frame to ready for indexed assignment d4[ord, ] <- d2 # invert by assignment print(d4) # x origRow rankX # 2 d 1 4 # 3 a 2 1 # 4 b 3 2 # 1 c 4 3
The idea is d2 <- d[ord, ]
applies the permutation represented by ord
and d4[ord, ] <- d2
undoes the permutation represented by ord
. The notation is so powerful it almost looks like declarative programing (and a lot like the explicit fixed-point operators we were able to write in R
here).
Let’s see that again:
print(ord) # [1] 2 3 4 1 invOrd <- numeric(length(ord)) # empty vector to ready for indexed assigment invOrd[ord] <- 1:length(ord) # invert by assignment print(invOrd[ord]) # [1] 1 2 3 4 print(invOrd) # [1] 4 1 2 3
We used the assignment invOrd[ord] <- 1:length(ord)
to “say” we wanted invOrd[ord]
to be “1 2 3 4
” and it is “1 2 3 4
“. This means invOrd
looks like an inverse of ord
, which is why it can undo the ord
permutation. We can get d2
into the correct order by writing d2[invOrd, ]
.
Why did that indexing work?
To work out why the above transformations are correct we will need a couple of transform rules (both to be established later!):
- Associativity of indexing:
(a[b])[c] = a[b[c]]
(a
,b
, andc
all permutations of1:n
). - Equivalence of left and right inverses:
a[b] == 1:n
if and only ifb[a] == 1:n
(a
andb
both permutations of1:n
). Note one does not havea[b] == b[a]
in general (check witha <- c(2,1,3); b <- c(1,3,2)
).
The above follow from the fact that composition of permutations (here seen as composition of indexing) form a group similar to function composition, but let’s work up to that.
Here is our reasoning to show d4
has its rows back in the original order of those of d1
:
d2$origRow == (1:n)[ord]
(asd2 == d[ord, ]
), sod2$origRow == ord
(as(1:n)[ord] == ord
).d4$origRow[ord] == d2$origRow
(by the left hand side assignment), sod4$origRow[ord] == ord
(by the last step).If we could just cancel off the pesky “
ord
” from both sides of this equation we would be done. That is in fact how we continue, bringing in rules that justify the cancellation.(d4$origRow[ord])[invOrd] == d4$origRow[ord[invOrd]]
(by associaitivity, which we will prove later!). So we can convert(d4$origRow[ord])[invOrd] == ord[invOrd]
(derivable from the last step) intod4$origRow[ord[invOrd]] == ord[invOrd]
.- Substituting in
ord[invOrd] == 1:n
(the other big thing we will show is:ord[invOrd] == invOrd[ord] == 1:n
) yieldsd4$origRow == 1:n
. This demonstratesd4
‘s rows must be back in the right order (that which we were trying to show).
So all that remains is to discuss associativity, and also show why invOrd[ord] == 1:n
(which was established by the assignment invOrd[ord] <- 1:length(ord)
) implies ord[invOrd] == 1:n
(what we actually used in our argument).
Some operator notation
To make later steps easier, let’s introduce some R
-operator notation. Define:
-
An array indexing operator:
`%[]%` <- function(a,b) { a[b] }
-
A function composition operator:
`%.%` <- function(f,g) { function(x) f(g(x)) }
(this is pure function composition, allows us to write
abs(sin(1:4))
as(abs %.% sin)(1:4)
). -
A function application operator:
`%+%` <- function(f,g) { f(g) }
(which is only right-associative as in
abs %+% ( sin %+% 1:4 ) == abs(sin(1:4))
). This notation is interesting as it moves us towards “point free notation”, though to move all the way there we would need a point-free function abstraction operator. -
A reverse application operator such as “
magrittr::`%>%`
” or the following imitation:`%|>%` <- function(f,g) { g(f) }
(which can write
abs(sin(1:4))
as1:4 %|>% sin %|>% abs
, the notation being in reference toF#
‘s as mentioned here). Here we left out the parenthesis not because%|>%
is fully associative (it is not), but because it is left-associative which is also the orderR
decides to evaluate user operators (so1:4 %|>% sin %|>% abs
is just shorthand for( 1:4 %|>% sin ) %|>% abs
).
With %[]%
our claim about inverses is written as follows. For permutations on 1:n
: ord %[]% invOrd == 1:n
implies invOrd %[]% ord == 1:n
. (Again we do not have a %[]% b == b %[]% a
in general as shown by a <- c(2,1,3); b <- c(1,3,2)
.)
Semigroup axioms
In operator notation we claim sequenced selectionlowing are true (not all of which we will confirm here). Call an element “a permutation” if it is an array of length n
containing each integer from 1
through exactly once. Call an element “a sequenced selection” if it is an array of length n
containing only integers from 1
through (possibly with repetitions). n
will be held at a single value throughout. All permutations are sequenced selections.
Permutations and sequenced selections can be confirmed to obey the following axioms:
- For all
a, b
sequenced selections we havea %[]% b
is itself a sequenced selection1:n
. If botha, b
are permutations thena %[]% b
is also a permutation. - (this is the big one) For all
a, b, c
sequenced selections we have:( a %[]% b ) %[]% c == a %[]% ( b %[]% c )
. - For all
a
that are sequenced selections we have:(1:n) %[]% a == a
anda %[]% (1:n) == a
. We call1:n
the multiplicative identity and it is often denoted ase <- 1:n
. - For all “
a
” a permutation there existsLa, Ra
permutations of1:n
such thatLa %[]% a == 1:n
anda %[]% Ra == 1:n
. Also, we can deriveLa == Ra
from the earlier axioms.
The above are essentially the axioms defining a mathematical object called a semigroup (for the sequenced selections) or group (for the permutations).
Checking our operator “%[]%
” obeys axioms 1 and 3 is fairly mechanical. Confirming axiom 4 roughly follows from the fact you can sort arrays (i.e. that order()
works). Axiom 2 (associativity) is the amazing one. A lot of the power of groups comes from the associativity and a lot of the math of things like Monads is just heroic work trying to retain useful semigroup-like properties.
A bit more on inverses
Suppose we had confirmed all of the above axioms, then our remaining job would be to confirm that invOrd[ord] == 1:n
implies ord[invOrd] == 1:n
. This is easy in the %[]%
notation.
The argument is as follows:
- By axiom 4 we know there exist
L
andR
such thatL ord == 1:n
andord R == 1:n
. In fact we have already foundL == invOrd
. So if we can showL==R
we are done. - Expand
L %[]% ord %[]% R
two ways using axiom 2 (associativity):L %[]% ord %[]% R == ( L %[]% ord ) %[]% R == (1:n) %[]% R == R
L %[]% ord %[]% R == L ( %[]% ord %[]% R ) == L %[]% (1:n) == L
- So:
L == R == invOrd
as required.
Note:
An important property of
%[]%
and%.%
is that they are fully associative over their values (permutations and functions respectively). We can safely re-parenthesize them (causing different execution order and different intermediate results) without changing outcomes. This is in contrast to%+%
,%|>%
, and%>%
which are only partially associative (yet pick up most of their power from properly managing what associativity they do have).
%>%
works well because it associates in the same direction asR
‘s parser. We don’t write parenthesis in “-4 %>% abs %>% sqrt
” because in this case it is unambiguous that it must be the case that the parenthesis are implicitly “( -4 %>% abs ) %>% sqrt
” (byR
left-associative user operator parsing) as “-4 %>% ( abs %>% sqrt)
” would throw an exception (so post hoc ergo propter hoc could not be howR
interpreted “-4 %>% abs %>% sqrt
“, as that did not throw an exception). So it isn’t that both associations are equal (they are not), it is that only one of them is “well formed” and that one happens to be the wayR
‘s parser works. It isn’t thatR
‘s parser is magically looking ahead to solve this, it is just the conventions match.
%[]%
is also neat in that values have an nice interpretation as functions over values. All the other operators are either more about functions (%.%
) or more about values (%+%
,%|>%
, and%>%
).
The amazing emergence of full associativity
Now consider the following two (somewhat complicated) valid R
expressions involving permutations a
, b
, and c
:
a[b[c]]
: which means calculatex <- b[c]
and then calculatea[x]
.(a[b])[c]
: which means calculatey <- a[b]
and then calculatey[c]
.
Consider this as a possible a bar bet with programming friends: can they find two vectors that are permutations of
1:n
(or even just lengthn
vectors consisting of any combination of taken from the integers < coce>1:n) where the above two calculations disagree.For example we could try the following:
n <- 4 a <- c(1,3,2,4) b <- c(4,3,2,1) c <- c(2,3,4,1) x <- b[c] print(a[x]) # [1] 2 3 1 4 y <- a[b] print(y[c]) # [1] 2 3 1 4 # whoa, they are equal
It is an amazing fact for the types of values we are discussing we always have:
a[b[c]] == (a[b])[c]
This is what we claim when we claim:
a %[]% ( b %[]% c ) == ( a %[]% b ) %[]% c
(i.e., when we claim associativity).
The above means we can neglect parenthesis and unambiguously write “a %[]% b %[]% c
” as both common ways of inserting the parenthesis yield equivalent values (though specify different execution order and entail different intermediate results).
Nested Indexing is an Instance of Function Composition
We can confirm associativity by working through all the details of array indexing (which would be a slog), or we can (as we will here) confirm this by an appeal to algebra. Either way the above claim is true, but sufficiently subtle that you certainly will not believe it without some more experience (which we will try to supply here). Associativity of indexing unintuitive mostly because it is unfamiliar; one rarely sees re-factoring code based on associativity.
As we mentioned above each different association or parenthesization specifies a different calculation, with different intermediate values- but they both result in the same value.
The proof is as follows. Consider each sequenced selection a, b, c
as a function that maps the integers 1:n
to the integers 1:n
(with a(i)
defined as a[i]
, and similar for b
and c
). Some inspection shows sequenced selections composed with the operator %[]%
must behave just as functions composed under %.%
. Function composition is famously fully associative, therefore (by the parallel behavior between %.%
and %[]%
) we know %[]%
is fully associative.
Function composition being fully associative usually comes as a surprise to non-mathematicians. I doubt most users of math regularly think about this. Here is why function composition being fully associative is hard to grasp.
- Function composition notation in standard mathematics (pre lambda-calculus) is actually written informally (even in “formal” proofs) and often confuses the reader with notation that seems to imply the function composition operator is “
`%+%` <- function(f,g) { f(g) }
” (which is the application operator, not the composition operator). The function composition operator is the more ungainly “`%.%` <- function(f,g) { function(x) f(g(x)) }
“. - Function composition being fully associative is a very strong statement. So strong one should not accept it without proof and working examples. So in that sense it is not intuitive.
- Function composition being associative follows very quickly from the proper axioms. In particular associativity is proven by simply substituting in a value for a variable (more on this later). This is by design: axioms are adjusted until fundamental concepts are near them. In fact Gian-Carlo Rota wrote “Great definitions are what mathematics contributes to the world” (Indiscrete Thoughts, p. 46). However, “by definition” proofs tend to be so short and move so fast that they look like mere notational tricks (and are thus hard to internalize).
Some nice writing on proving function composition is associative can be found here: “Is composition of functions associative?”. Function composition is a very learnable concept (and well worth thinking about it). Do not worry if it takes you a long time to get comfortable with it. Nobody understands it quickly the first time (though it is a very sensible topic deeply understood by very many mathematical teachers and writers).
Conclusion
In this writeup we had the rare pleasure of showing two different implementations of a concept (nested indexing) are equivalent. In programming there are very few operations that are so regular and interchangeable. This is why I advocate design choices that preserve referential transparency (the statement you can safely substitute values for variables, which is one of the few things that let’s us reason about programs) to keep as many of these opportunities as practical.
At this point I hope you find the vectorized square-bracket as nifty as I do. It allows some very succinct expressions of powerful sorting and permutations steps. The “find the inverse by putting the square-bracket on the left side” is one of my favorite coding tricks, and actually quite useful in arranging data for analysis (especially ordered data such as time series, or when you need to work with ranks or quantiles). It always seems “a bit magic.” It really it is a bit magic, but it is also formally correct and reliable.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.