Site icon R-bloggers

Now You’re Thinking with Arrays

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I keep hearing the assertion that “writing APL/Haskell/etc… makes you think differently” and I kept wondering why I agreed with the statement but at the same time didn’t think too much of it. I believe I’ve figured out that it’s because I happened to have been using Array-aware languages this whole time! It turns out R is an even better language for beginners than I thought.

Let’s start with some basics. A “scalar” value is just a number by itself. That might have some units that may or may not be represented well in what you’re doing, but it’s a single value on its own, like 42. A “vector” in R is just a collection of these “scalar” values and is constructed with the c() operator

c(3, 4, 5, 6)
## [1] 3 4 5 6

Going right back to basics, the [1] output at the start of the line indicates the index of the element directly to its right, in this case the first element. If we had more elements, then the newline starts with the index of the first element on that line. Here I’ve set the line width smaller than usual so that it wraps sooner

1:42
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## [13] 13 14 15 16 17 18 19 20 21 22 23 24
## [25] 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42

The quirk of how R works with vectors is the there aren’t actually any scalar values – if you try to create a vector with only a single element, it’s still a vector

x <- c(42)
x
## [1] 42
is.vector(x)
## [1] TRUE

(note the [1] indicating the first index of the vector x and the vector TRUE). Even if you don’t try to make it a vector, it still is one

x <- 42
x
## [1] 42
is.vector(x)
## [1] TRUE

Mike Mahoney has a great post detailing the term “vector” and how it relates to an R vector as well as the more mathematical definition which involves constructing an “arrow” in some space so that you describe both “magnitude” and “direction” at the same time.

“Direction and magnitude”

So, we always have a vector if we have a 1-dimensional collection of data. But wait, you say, there’s also list!

x <- list(a = 42)
x
## $a
## [1] 42
is.vector(x)
## [1] TRUE

A nice try, but lists are also vectors, it’s just that they’re “recursive”

is.recursive(x)
## [1] TRUE
is.recursive(c(42))
## [1] FALSE

Fine, what about a matrix, then?

x <- matrix(1:9, nrow = 3, ncol = 3)
x
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
is.vector(x)
## [1] FALSE

No, but that still makes sense – a matrix isn’t a vector. It is however, an “array” – the naming convention in R is a bit messy because, while “matrix” and “array” are often the same thing, as the dimensions increase, more things expect an “array” class, so R tags a “matrix” with both

class(matrix())
## [1] "matrix" "array"

This recently surprised Josiah Parry leading to this post explaining some of the internal inconsistencies of this class of object.

Now that we have vectors figured out, I can get to the point of this post – that thinking about data with a “vector” or even “array” mindset works differently.

I started learning some APL because I loved some videos by code_report. The person behind those is Conor Hoekstra. I didn’t realise that I’d actually heard a CoRecursive podcast episode interviewing him, so now I need to go back and re-listen to that one. Conor also hosts The Array Cast podcast that I heard him mention in yet another of his podcasts (how do people have the time to make all of these!?!). I was listening to the latest of these; an interview with Rob Pike, one of the co-creators of Go and UTF-8 – it’s a really interesting interview full of history, insights, and a lot of serious name dropping.

Anyway, Rob is describing what it is he really likes about APL and says

“I saw a talk some time ago, I wish I could remember where it was, where somebody said, this is why programming languages are so difficult for people. Let’s say that I have a list of numbers and I want to add seven to every number on that list. And he went through about a dozen languages showing how you create a list of numbers and then add seven to it. Right? And it went on and on and on. And he said,”Wouldn’t it be nice if you could just type 7+ and then write the list of numbers?” And he said, “Well, you know what? There’s one language that actually does that, and that’s APL.” And I think there’s something really profound in that, that there’s no ceremony in APL. If you want to add two numbers together in any language, you can add two numbers together. But if you want to add a matrix and a matrix, or a matrix and a vector, or a vector and a scaler or whatever, there’s no extra ceremony involved. You just write it down.

(link to shownotes)

The talk he mentions is linked in the shownotes as “Stop Writing Dead Programs” by Jack Rusher (Strange Loop 2022) (linked to the relevant timestamp, and which I’m pretty sure I’ve watched before – it’s a great talk!) where Jack shows how to add 1 to a vector of values in a handful of languages. He demonstrates that in some languages there’s lots you need to write that has nothing to do with the problem itself; allocating memory, looping to some length, etc… then leads to demonstrating that the way to do this in APL is

1 + 1 2 3 4

with none of the overhead – just the declaration of what operation should occur.

The excitement with which Rob explains this in the podcast spoke to how important this idea is; that you can work with more than just scalar values in the mathematical sense without having to explain to the language what you mean and write a loop around a vector.

Two questions were buzzing at the back of my mind, though:

  1. Why isn’t this such a revelation to me?

  2. Is this not a common feature?

I know R does work this way because I’m very familiar with it, and perhaps that is the answer to the first question – I know R better than any other language I know, and perhaps I’ve just become accustomed to being able to do things like “add two vectors”.

a <- c(1, 2, 3, 4, 5) # or 1:5
a + 7
## [1]  8  9 10 11 12
Now you’re thinking with portals vectors!

The ideas of “add two vectors” and “add a number to a vector” are one in the same, as discussed above. The ability to do so is called “rank polymorphism” and R has a weak version of it – not everything works for every dimension, but single values, vectors, and matrices do generalise for many functions. I can add a value to every element of a matrix, too

m <- matrix(1:12, nrow = 3, ncol = 4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
m + 7
##      [,1] [,2] [,3] [,4]
## [1,]    8   11   14   17
## [2,]    9   12   15   18
## [3,]   10   13   16   19

and adding a vector to a matrix repeats the operation over rows

m <- matrix(1, nrow = 3, ncol = 4)
m
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    1    1    1    1
## [3,]    1    1    1    1
v <- c(11, 22, 33)
m + v
##      [,1] [,2] [,3] [,4]
## [1,]   12   12   12   12
## [2,]   23   23   23   23
## [3,]   34   34   34   34

Sidenote: the distinction between “repeat over rows” and “repeat over columns” is also discussed in the Array Cast episode – if you want to know more, there’s “leading axis theory”. R uses column-major order which is why the matrix m filled the sequential values down the first column, and why you need to specify byrow = TRUE if you want to fill the other way. It’s also why m + v repeats over rows, although if you are expecting it to repeat over columns and try to use a v with 4 elements it will (silently) work, recycling the vector v, and giving you something you didn’t expect

v <- c(11, 22, 33, 44)
m + v
##      [,1] [,2] [,3] [,4]
## [1,]   12   45   34   23
## [2,]   23   12   45   34
## [3,]   34   23   12   45

{reticulate} has a really nice explainer of the differences between R (column-major) and python (row-major), and importantly, the interop between these two.

So, is working with arrays actually so uncommon? I first thought of Julia, and since it’s much newer than R and took a lot of inspiration from a variety of languages, perhaps it works

a = [1, 2, 3, 4, 5]
a + 7
ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar

Not quite, but the error message is extremely helpful. Julia wants to perform element-wise addition using the broadcasting operator . so it needs to be

a .+ 7
5-element Vector{Int64}:
  8
  9
 10
 11
 12

Still, that’s a “know the language” thing that’s outside of “add a number to a vector”, so no credit.

Well, what about python?

a = [1, 2, 3, 4, 5]
a + 7
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list

The canonical way, I believe, is to use a list comprehension

a = [1, 2, 3, 4, 5]
[i + 7 for i in a]
[8, 9, 10, 11, 12]

and we’re once more using a language feature that’s outside of “add a number to a vector” so again, no credit. For the pedants: there is library support for this if you use numpy

import numpy as np

a = [1, 2, 3, 4, 5]
np.array(a) + 7
array([ 8,  9, 10, 11, 12])

but I wouldn’t call that a success.

I asked ChatGPT what other languages could do this and it suggested MATLAB. Now, that’s a proprietary language I don’t have access to, but octave is an open-source alternative that is more or less the same, so I tried that

a = [1, 2, 3, 4, 5];
a
a =

   1   2   3   4   5
a + 7
ans =

    8    9   10   11   12

Okay, yeah – a win for MATLAB. I remember using MATLAB back at Uni in an early (second year?) Maths (differential equations?) course and it was probably the very first time I was actually introduced to a programming language. IIRC, “everything is a matrix” (which works out okay for engineering and maths use-cases) so this a) probably isn’t surprising that it works, and b) makes sense that it gets lumped in with the “array languages”.

Thinking back to the other programming languages I’ve learned sufficiently, I wondered how Fortran dealt with this – I used Fortran (90) for all of my PhD and postdoc calculations. I loved that Fortran had vectors (and n-dimensional arrays) without having to do any manual memory allocation, and for that reason alone it was well-suited to theoretical physics modeling. I’ve been re-learning some Fortran via Exercism, so I gave that a go

$ cat array.f90
program add_to_array

  implicit none
  integer, dimension(5) :: a

  a = (/1, 2, 3, 4, 5/)
  print *, a + 7

end program add_to_array

Compiling and running this…

$ gfortran -o array array.f90
$ ./array
           8           9          10          11          12

A win! Okay, a little ceremony to declare the vector itself, but that’s strict typing for you.

With these results at hand, I think back to the question

Why isn’t this such a revelation to me?

I learned MATLAB, Fortran, then R, over the course of about a decade, and barely touched other languages with any seriousness while doing so… I’ve been using array languages more or less exclusively all this time.

“You merely learned to use arrays, I was born in them, molded by them”

Better still, they’re all column-major array languages.

I think this is why APL seems to beautiful to me – it does what I know I want and it does it with the least amount of ceremony.

I wrote a bit about this in a previous post – that a language can hide some complexity for you, like the fact that it does need to internally do a loop over some elements in order to add two vectors, but when the language itself provides an interface where you don’t have to worry about that, things get beautiful.

At PyConAU this year there was a keynote “The Complexity of Simplicity” which reminded me a lot of another post “Complexity Has to Live Somewhere”. I think APL really nailed removing a lot of the syntax complexity of a language, leaving mainly just the operations you wish to perform. Haskell does similar but adds back in (albeit, useful) language features that involve syntax.

Of the languages I did learn first, I would have to say that R wins over MATLAB and Fortran in terms of suitability as a first programming language, but now that I recognise that the “array” way of thinking comes along with that, I really do think it has a big advantage over, say, python in terms of shaping that mindset. Sure, if you start out with numpy you may gain that same advantage, but either way I would like to think there’s a lot to be gained from starting with an “array-aware” language.

Did I overlook another language that can work so nicely with arrays? Have you reflected on how you think in terms of arrays and programming in general? Let me know in the comments or on Mastodon.


< details> < summary> devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-29
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────


To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version