Site icon R-bloggers

The quest for fast(er?) row-oriented workflows

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The past few weeks I have been exploring the speed of R. It started with this video in which I explained that R is not necessarily slower than any other interpreted language, as long as you’re using the built-in, optimized functions. However should you write your own implementation of an algorithm, especially if that algorithm requires the use of one (or more…) loops, it’ll run slowly. As I’ve also mentioned in two other videos, here and here there are many ways to avoid loops, and you should do so if possible.

To continue exploring this is in more detail, I’ve written two very basic implementations of a genetic algorithm. The first version uses only {tidyverse} functions and the other only base R functions. My intuition was that base would be faster, but the code would likely be less “readable” (I discuss this code in greater detail in a series of videos, you can watch part 1 and part 2 if you’re interested in the nitty-gritty details). Code readability is quite subjective, but I think that there are some general “truths” regarding it, namely that it seems to often be that case that fast code is code that is “less” readable, and vice-versa. This blog post explores this trade-off in the context of row-oriented workflows.

Once I was done writing the two versions of the genetic algorithm for the video (a {tidyverse} one and a base one), I profiled the code and realised that, yes base was much much faster, but also that the reason the {tidyverse} version was running so slowly was because of one single row-based operation. Trying to replace this row-based operation, but remaining inside the {tidyverse} made for an interesting challenge. I will explain what I did in this blog post, so first let’s set up the example:

library(tidyverse)
library(brotools)

Let’s first generate some data. For this, I’m going to use a function I wrote for my genetic algorithm. I won’t explain how it works, so if you’re curious, you can watch the videos I mention in the introduction where this is all explained in detail:

init_pop <- function(objective_function, pop_size = 100, upper_bound = 1, lower_bound = 0){

  parameters <- formals(objective_function)[[1]] %>%
    eval

  purrr::rerun(length(parameters), runif(n = pop_size,
                                         min = lower_bound,
                                         max = upper_bound)) %>%
    dplyr::bind_cols() %>%
    janitor::clean_names()

}

This function takes another function, the objective function to be optimized, as an argument, and checks how many parameters this objective functions has, and generates a population of random solutions (if you don’t understand what this all means don’t worry. What matters is that this generates a random dataset whith as many columns as the objective function has arguments).

The next function is my objective function:

my_function <- function(x = c(0, 0, 0, 0, 0, 0)){
  x1 <- x[1]
  x2 <- x[2]
  x3 <- x[3]
  x4 <- x[4]
  x5 <- x[5]
  x6 <- x[6]

  -(x1**2 + x2 - 11)**2 - (x1 + x2**2 - 7)**2 - (x3**3 + x4 - 11)**2 - (x5 + x6**2 - 7)**2
}

(This is not the same as in the videos, which only has two arguments.)

Let’s generate some data:

dataset <- init_pop(my_function) %>%
  as.data.frame()
head(dataset)
##           x1         x2         x3         x4        x5        x6
## 1 0.01240753 0.52216369 0.03258892 0.09722631 0.7485749 0.9275532
## 2 0.35423196 0.13766202 0.32204460 0.05759782 0.2143725 0.2391541
## 3 0.10378066 0.07571073 0.28378176 0.51448928 0.9616971 0.2702277
## 4 0.52591063 0.46718618 0.99404463 0.86059204 0.7264650 0.8083685
## 5 0.46208664 0.56516052 0.46177016 0.15401250 0.9280011 0.9152156
## 6 0.98528077 0.24898464 0.28815249 0.37093507 0.9020534 0.4355511

Now, on the actual problem: I need to add another column, with the value of my_function(), evaluated on a per row basis. As an example, for the first row, this would be the result of:

my_function(dataset[1, ])
##          x1
## 1 -302.8056

Many people would probably solve this using a for loop, so let’s write a function to do just that (benchmarking will make it easier if the code is inside a function):

run_loop <- function(dataset, my_function = my_function){

  dataset$score <- 0

  for(i in seq(1, nrow(dataset))){

    dataset$score[i] <- my_function(dataset[i, ])
  }

  dataset

}


run_loop(dataset, my_function = my_function) %>%
  head
##           x1         x2         x3         x4        x5        x6     score
## 1 0.01240753 0.52216369 0.03258892 0.09722631 0.7485749 0.9275532 -302.8056
## 2 0.35423196 0.13766202 0.32204460 0.05759782 0.2143725 0.2391541  -323.473
## 3 0.10378066 0.07571073 0.28378176 0.51448928 0.9616971 0.2702277 -311.6355
## 4 0.52591063 0.46718618 0.99404463 0.86059204 0.7264650 0.8083685 -259.7646
## 5 0.46208664 0.56516052 0.46177016 0.15401250 0.9280011 0.9152156 -286.0531
## 6 0.98528077 0.24898464 0.28815249 0.37093507 0.9020534 0.4355511 -278.4643

The advantage of loops is that you don’t need to really know a lot about R to get it done; if you’ve learned some programming language some time during your studies, you learned about for loops. But they’re generally slower than other methods and error-prone (typos for example, or if you’re looping over several indeces, it can get quite complex…). And they’re, in my humble opinion, not always very easy to understand. This is not the case here, because it is quite a simple example, but often, it can get quite confusing to understand what is going on.

So what would be a more “R-specific” way of doing it (specific in the sense that it is not a universal solution like a for-loop), and which avoids using a loop? apply() would here be the best candidate:

apply(dataset, MARGIN = 1, FUN = my_function)
##   [1] -302.8056 -323.4730 -311.6355 -259.7646 -286.0531 -278.4643 -294.4777
##   [8] -298.5700 -291.3642 -308.0560 -305.0222 -275.0652 -284.7013 -273.9297
##  [15] -268.5652 -278.1445 -288.9499 -302.2236 -293.8893 -280.6697 -324.6310
##  [22] -301.8986 -283.3776 -263.1386 -279.0203 -274.6004 -242.8429 -316.1679
##  [29] -274.2564 -258.7995 -254.5350 -313.4797 -262.1684 -290.9866 -302.7581
##  [36] -306.8771 -279.7645 -288.2144 -290.5971 -256.1781 -279.2323 -292.5604
##  [43] -295.9565 -271.1590 -305.5433 -282.4037 -292.0525 -285.0635 -303.1871
##  [50] -264.1534 -253.6944 -299.2543 -252.5947 -263.6670 -257.2571 -297.9045
##  [57] -309.2746 -305.7162 -288.1244 -264.2176 -295.2711 -316.3825 -268.1744
##  [64] -286.3890 -313.4759 -309.1958 -291.4537 -305.1202 -269.3924 -305.3369
##  [71] -292.2627 -280.4522 -271.2382 -291.9913 -283.5446 -274.8310 -268.6854
##  [78] -289.4784 -278.0444 -280.8523 -264.4792 -283.4950 -287.7354 -268.5446
##  [85] -298.2402 -280.7423 -302.1886 -280.4117 -270.9505 -306.2466 -293.8185
##  [92] -249.7290 -270.2272 -282.2159 -278.4562 -270.2936 -288.2470 -245.2684
##  [99] -283.6952 -288.1078

Appending this to a dataframe can be done within a mutate() call (here again I’m encapsulating this inside a function, for benchmarking purposes):

run_apply <- function(dataset, my_function = my_function){

  dataset %>%
    mutate(score = apply(., MARGIN = 1, my_function))

}

run_apply(dataset, my_function = my_function) %>%
  head()
##           x1         x2         x3         x4        x5        x6     score
## 1 0.01240753 0.52216369 0.03258892 0.09722631 0.7485749 0.9275532 -302.8056
## 2 0.35423196 0.13766202 0.32204460 0.05759782 0.2143725 0.2391541 -323.4730
## 3 0.10378066 0.07571073 0.28378176 0.51448928 0.9616971 0.2702277 -311.6355
## 4 0.52591063 0.46718618 0.99404463 0.86059204 0.7264650 0.8083685 -259.7646
## 5 0.46208664 0.56516052 0.46177016 0.15401250 0.9280011 0.9152156 -286.0531
## 6 0.98528077 0.24898464 0.28815249 0.37093507 0.9020534 0.4355511 -278.4643

MARGIN = 1 means that the function is applied on the rows, whereas MARGIN = 2 would apply the function over columns.

In terms of readability, I think that this is maybe a bit less readable than the for-loop, just because for-loops as very very ubiquitous. But it’s super simple once you understand how apply() works.

Now, what would be a {tidyverse}-only approach? And why would you want to do a {tidyverse}-only approach anyways? Generally, I would argue that scripts written using {tidyverse} functions and style are easier to read. For example, I tweeted this code snippet some time ago:

blogdown::shortcode("tweet",
                    "1431718740341764099"
                    )

I don't see how one could prefer base #RStats

but then again, there's people out there who like licorice candies… pic.twitter.com/bwW74SRDrN

— Bruno Rodrigues (@brodriguesco) August 28, 2021

and in my opinion the example in my tweet shows clearly that the {tidyverse} code is more easily understood and readable. Of course, some people disagree… However, in this case here, I’m not sure that a {tidyverse} approach would be more readable. The solution using apply() seems to me to be quite good. Let’s see how the {tidyverse} approach, which leverages rowwise(), looks like:

run_rowwise <- function(dataset, my_function = my_function){
  dataset %>%
    rowwise() %>%
    mutate(score = my_function(c_across(everything()))) %>%
    ungroup()
}

run_rowwise(dataset, my_function = my_function) %>%
  head()
## # A tibble: 6 × 7
##       x1     x2     x3     x4    x5    x6 score
##    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 0.0124 0.522  0.0326 0.0972 0.749 0.928 -303.
## 2 0.354  0.138  0.322  0.0576 0.214 0.239 -323.
## 3 0.104  0.0757 0.284  0.514  0.962 0.270 -312.
## 4 0.526  0.467  0.994  0.861  0.726 0.808 -260.
## 5 0.462  0.565  0.462  0.154  0.928 0.915 -286.
## 6 0.985  0.249  0.288  0.371  0.902 0.436 -278.

This runs, but much, much, more slower than with apply() (but faster than a for-loop, as we shall see) . Plus, it does look much, much more complicated than the simple apply() version! So why do it like this? You even need several functions – rowwise(), c_across() and everything() – to make it work! So why? Well, there is one use case in which this approach enables you to do something that I don’t think is possible (or at least easily possible) with apply() which is applying the function, but only over certain columns. For example, if you want to apply the function only over the columns which names all start with the letter “c”, you could write something like this:

mtcars %>%
  rowwise() %>%
  mutate(score = mean(c_across(starts_with("c")))) %>%
  ungroup()
## # A tibble: 32 × 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb score
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   5  
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   5  
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   2.5
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   3.5
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2   5  
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   3.5
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   6  
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   3  
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   3  
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   5  
## # … with 22 more rows

Now this is not needed here, so apply() clearly wins in terms readability (and speed as well). But in cases like the above, where you need to compute only over several columns, I think that the {tidyverse} version not only is very readible, but actually offers a solution to the problem. I’m not quite sure you could solve this easily with base, but please prove me wrong.

In any case, there’s another way to approach our original problem using {tidyverse} functions, but we still need the help of a base function.

The next approach uses the fact that map() needs both a list and a function as an input. As a refresher, here’s how map works:

# We have a list

my_list <- list("a" = 2,
                "b" = 4)

# and we have a function, say sqrt, which we want to apply to each element of this list

map(my_list, sqrt)
## $a
## [1] 1.414214
## 
## $b
## [1] 2

So what we need is a way to mimick the basic approach which works on one “element” (in this case, a row of the dataframe), and extend that idea to a “list of rows”. Now, the issue is that a dataframe is actually a list of columns, not rows. So if you’re using map() over a dataframe, you will be looping over the columns, not the rows, as in the example below:

# This applies the function class() to each colum of mtcars
mtcars %>%
  map(class)
## $mpg
## [1] "numeric"
## 
## $cyl
## [1] "numeric"
## 
## $disp
## [1] "numeric"
## 
## $hp
## [1] "numeric"
## 
## $drat
## [1] "numeric"
## 
## $wt
## [1] "numeric"
## 
## $qsec
## [1] "numeric"
## 
## $vs
## [1] "numeric"
## 
## $am
## [1] "numeric"
## 
## $gear
## [1] "numeric"
## 
## $carb
## [1] "numeric"

Now the question becomes; is there a way to turn a dataframe, which is a list of columns, into a list of rows? Yes, there is, using asplit():

asplit(mtcars, MARGIN = 1) %>%
  head()
## $`Mazda RX4`
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
##  21.00   6.00 160.00 110.00   3.90   2.62  16.46   0.00   1.00   4.00   4.00 
## 
## $`Mazda RX4 Wag`
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
##  21.000   6.000 160.000 110.000   3.900   2.875  17.020   0.000   1.000   4.000 
##    carb 
##   4.000 
## 
## $`Datsun 710`
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
##  22.80   4.00 108.00  93.00   3.85   2.32  18.61   1.00   1.00   4.00   1.00 
## 
## $`Hornet 4 Drive`
##     mpg     cyl    disp      hp    drat      wt    qsec      vs      am    gear 
##  21.400   6.000 258.000 110.000   3.080   3.215  19.440   1.000   0.000   3.000 
##    carb 
##   1.000 
## 
## $`Hornet Sportabout`
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
##  18.70   8.00 360.00 175.00   3.15   3.44  17.02   0.00   0.00   3.00   2.00 
## 
## $Valiant
##    mpg    cyl   disp     hp   drat     wt   qsec     vs     am   gear   carb 
##  18.10   6.00 225.00 105.00   2.76   3.46  20.22   1.00   0.00   3.00   1.00

asplit() splits a dataframe along rows (with the MARGIN argument set to 1) or along columns (with MARGIN = 2). As you can see with the code above, the mtcars dataset is now a list of rows. Each element of this list is a single vector of values. Now that my dataframe is now a list of rows, well, I can simply use map() to apply any function over its rows:

run_map <- function(dataset, my_function = my_function){
  dataset %>%
    mutate(score = map_dbl(asplit(., 1), .f = my_function))
}

run_map(dataset, my_function = my_function) %>%
  head()
##           x1         x2         x3         x4        x5        x6     score
## 1 0.01240753 0.52216369 0.03258892 0.09722631 0.7485749 0.9275532 -302.8056
## 2 0.35423196 0.13766202 0.32204460 0.05759782 0.2143725 0.2391541 -323.4730
## 3 0.10378066 0.07571073 0.28378176 0.51448928 0.9616971 0.2702277 -311.6355
## 4 0.52591063 0.46718618 0.99404463 0.86059204 0.7264650 0.8083685 -259.7646
## 5 0.46208664 0.56516052 0.46177016 0.15401250 0.9280011 0.9152156 -286.0531
## 6 0.98528077 0.24898464 0.28815249 0.37093507 0.9020534 0.4355511 -278.4643

So we now have 4 approaches to solve the issue:

Let’s set up a function to run some benchmarks and see which runs faster. I’ll create a list of increasingly large data frames over which I’ll run all the above functions:

list_datasets <- map(seq(2, 5), ~init_pop(objective_function = my_function,
                                          pop_size = `^`(10, .x)))

The function below will run the benchmarks over all the data frames:

run_benchmarks <- function(dataset, times = 5){
  microbenchmark::microbenchmark(
                    run_loop(dataset, my_function = my_function),
                    run_apply(dataset, my_function = my_function),
                    run_rowwise(dataset, my_function = my_function),
                    run_map(dataset, my_function = my_function),
                    times = times,
                    unit = "s"
                  )
}

I’ll run this in parallel using {furrr}:

library(furrr)

plan(multisession, workers = 2)

benchmark_results <- future_map(list_datasets, run_benchmarks)

Let’s take a look at the results:

benchmark_data <- map2(.x = benchmark_results, .y = 10^seq(2, 5), .f = ~mutate(tibble(.x), pop_size = .y)) %>%
  bind_rows() %>%
  mutate(expr = str_remove_all(expr, "\\(.*\\)")) %>%  
  group_by(expr, pop_size) %>%
  mutate(time_seconds = time/10^9) %>%
  summarise(fastest_run = min(time_seconds),
            average_run = mean(time_seconds),
            slowest_run = max(time_seconds))
## `summarise()` has grouped output by 'expr'. You can override using the `.groups` argument.
benchmark_data %>%
  ggplot(aes(y = average_run, x = pop_size)) +
  geom_ribbon(aes(ymin = fastest_run, ymax = slowest_run, fill = expr), alpha = .6) +
  geom_line(aes(group = expr, col = expr)) +
  ylab("Seconds") +
  xlab("Rows in the dataset") +
  ggtitle("Speed of rowwise operations using different methods") +
  theme_blog()

Using a for-loop for row-wise operations is clearly the slowest solution. Let’s take a closer look at the remaining 3 options:

benchmark_data %>%
  filter(!grepl("loop", expr)) %>% 
  ggplot(aes(y = average_run, x = pop_size)) +
  geom_ribbon(aes(ymin = fastest_run, ymax = slowest_run, fill = expr), alpha = .6) +
  ylab("Seconds") +
  xlab("Rows in the dataset") +
  ggtitle("Speed of rowwise operations using different methods") +
  theme_blog()

rowwise() loses here, but unless you have to literally run such code hundreds of times, it is still tolerable. Gives you enough time to browse some memes. But if you have to run such operations millions of times, you might want to look at either using apply() or the other approach that uses asplit() and map(). Let’s take a closer look at these two:

benchmark_data %>%
  filter(!grepl("loop|rowwise", expr)) %>%
  ggplot(aes(y = average_run, x = pop_size)) +
  geom_ribbon(aes(ymin = fastest_run, ymax = slowest_run, fill = expr), alpha = .6) +
  geom_line(aes(group = expr, col = expr)) +
  ylab("Seconds") +
  xlab("Rows in the dataset") +
  ggtitle("Speed of rowwise operations using different methods") +
  theme_blog()

Interestingly, the fastest run using map() was faster than the fastest run using apply(), but on average, both methods seem to be equivalent. In conclusion, if you need speed and you need to compute over every column apply() is a clear winner. But if you need row-wise operations, but only on a subset of columns, rowwise(), even though it is slow, seems to be the only solution.

I wonder if there is a way to use c_across() with the map() approach, and potentially have the benefits of map() (as fast as apply()) and rowwise() (computing only over certain columns…). Another subject to explore later.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

  • Buy me an Espresso

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.