Site icon R-bloggers

How to use vectorization to streamline simulations

[This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

While grading some homework it became apparent that many of the idioms of R are not widely known and aren’t particularly intuitive to newcomers. Two key features of R (and why I like the language so much) are vectorization and higher order functions. These features overlap with functional programming and form a powerful toolkit to implement models and run simulations with clear and concise code. A student generously gave permission to post his homework here so that I can use it as the basis for the discussion.

The starting point is the first question from Introduction to Probability written by Grinstead & Snell. The question is based on a program freely available but not in R. Instead we’ll see how easy it is to implement a simulation of this sort in R.

Modify the program CoinTosses to toss a coin n times and print out after every 100 tosses the proportion of heads minus 1/2. Do these numbers appear to approach 0 as n increases? Modify the program again to print out, every 100 times, both of the following quantities: the proportion of heads minus 1/2, and the number of heads minus half the number of tosses. Do these numbers appear to approach 0 as n increases?

Most people begin these sorts of problems by thinking about an algorithm that loops n times. For a given step some sort of calculation is performed. Many of my students took this approach. The below snippet is a slightly edited version of what one of my students submitted.

coin_toss <- function(n) {
  result <- c()
  for(i in c(1:n)) {
    ## the first flip we just assign the toss result to tosses
    if(i == 1){
      ## the optional outputs are 0 and 1.  I am assigning 1 to heads
      tosses <- sample(c(0,1),1)
    }
    else{
      ## creating a vector that has history of all tosses
      tosses <- c(tosses,sample(c(0,1),1))
    }

    ## when we reach a toss number that a multiple of 100 we output the status
    if(i %% 100 == 0){

      ## output the percent of heads away from 50%
      percent <- (sum(tosses) / length(tosses)) - 0.5

      ## output the number of heads away from half of all tosses
      number <- sum(tosses) - (length(tosses) / 2)
      result <- rbind(result, c(percent, number))
    }
  }
  result
}

Now this function does exactly what the question asks, but it’s a lot of work to get there. It’s also error prone and difficult to debug. As I edited the function to return a matrix of the results instead of printing them out, I inserted result between the wrong set of braces. Part of the reason is that the question is posed in a way that assumes a procedural algorithm. This is fine in other languages but in R it is much simpler to take advantage of the R idioms to get there a lot faster. With this approach there is no need for loops or conditional branches. There is also no need for iterative array construction. Instead everything is done in one shot using a set-theoretic approach combined with function transformations.

coin_toss <- function(n, step=100) {
  # Record number of heads at each step
  tosses <- cumsum(sample(c(0,1), n, replace=TRUE))
  # Define steps for summaries
  steps <- seq(step,n, by=step)
  # Compute summaries
  cbind(tosses[steps] / steps - .5, tosses[steps] - steps/2)
}

This new function accomplishes the same thing as the original but is much more compact. The key is that in this implementation we are leveraging the vectorization built into R.

Vectorization

Coming from an imperative (explicit operations a la a Turing machine) background, the concept of vectorization may appear alien at first. It is actually quite natural to think this way and begins by recognizing that all primitives in R are vectors. Hence a scalar is really a vector of length 1. A vector can thus represent a complete domain such that a function operates on the domain and generates the corresponding range as a single call. Hence there is no need to initialize and update variables since a vectorized function will perform the operation over every element of a vector operand in a single shot. Vectorization indicates that a function operates on a vector and not a scalar value.

Consider the arithmetic operators, which are all vectorized in R. In other languages, adding two vectors x and y together requires some sort of loop or iterative process (like tail recursion). The uninitiated will do this in R as well. Here is the same pattern as in the original coin_toss function with a simplification around the initialization of z.

x <- 1:5
y <- 6:10
z <- NULL
for (i in c(1:4)) {
  z <- c(z, x[i] + y[i])
}

The idiomatically consistent solution is to take advantage of the vectorization of the arithmetic operators and do the summation all at once.

x <- 1:5
y <- 6:10
z <- x + y

Vectorization is not limited to the arithmetic operators. The boolean operators, in addition to most of the base functions (e.g. cumsum, grep, cat, etc.) are vectorized.

Higher-Order Functions

This concept of vectorization actually ties back to functional programming and the concept of higher-order functions. The vectorization I described above is related to the map function, which takes a vector and a scalar function as arguments and applies each element of the vector to the function, returning a vector back. Our arithmetic operators are doing exactly this with all the details hidden in the implementation.

There is another type of vectorization, which is tied to the concept of fold (also called reduce). In this construction, a vector is aggregated into a single value. The most common example of this is the summation operator in mathematics, where a sequence is transformed into a scalar. The corresponding function in R looks like sum(x). Of course this operation can be generalized and used in different contexts.

Many operations preserve vectorization if constructed with care. This is what I’m doing in the last line of the function. The steps variable is a set of indices that I want to use for my summary. Since the indexing operator is similarly vectorized, I can select a subset of my vector and then perform vectorized arithmetic. The cbind function tells R to construct a compatible data structure by stacking the arguments together as column vectors. Hence it’s a quick way to construct a matrix and is equivalent to the mathematical notation that makes a matrix from two column vectors.

Now what happens if you create a custom function that isn’t vectorized? In this situation we need to explicitly use a map-like function to provide the necessary wiring. In R the map concept has many cousins named *apply, due to the many data types and structures available. The below example shows an equivalent construction using the higher-order function sapply. This function takes a vector input and passes each element to the provided function. Since I’m returning a vector, sapply will return the output as a matrix.

t(sapply(steps, function(x) c(tosses[x] / x - .5, tosses[x] - x/2)))

Fully leveraging vectorization and higher-order functions will greatly improve the readability of your code. Many simulations will become one- or two-liners. It does take some time to think of vectors as sets and functions as operating on a complete domain, but the rewards are legion. Once you embrace these idioms, you’ll discover that it is easier to implement your models because there is less time translating mathematical operations to imperative algorithms. I discuss these concepts in more detail, including understanding the properties of vectorization in my forthcoming book.


To leave a comment for the author, please follow the link and comment on their blog: Cartesian Faith » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.