Site icon R-bloggers

Why you should(n’t) care about Monads if you’re an R programmer

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction: functions

To understand Monads, I think it’s useful to first think about functions; why do we use functions? Why don’t we simply write scripts with the required operations one after the other? For instance, to compute the average height by species in a data set of individuals from the famous space opera “Star Wars”, we could very well write this code:

suppressPackageStartupMessages(library(dplyr))

data(starwars)

sum_humans <- 0
sum_others <- 0
n_humans <- 0
n_others <- 0

for(i in seq_along(1:nrow(starwars))){

  if(!is.na(unlist(starwars[i, "species"])) &
     unlist(starwars[i, "species"]) == "Human"){
    if(!is.na(unlist(starwars[i, "height"]))){
      sum_humans <- sum_humans + unlist(starwars[i, "height"])
      n_humans <- n_humans + 1
    } else {

      0

    }

  } else {
    if(!is.na(unlist(starwars[i, "height"]))){
      sum_others <- sum_others + unlist(starwars[i, "height"])
      n_others <- n_others + 1
    } else {
      0
    }
  }
}

mean_height_humans <- sum_humans/n_humans
mean_height_others <- sum_others/n_others

Well, we could do it like this, but we definitely shouldn’t:

  • what this code does is not immediately obvious. If the code blocks aren’t commented, readers of this code will have to read line by line to understand what is going on;
  • this code is not reusable. If now I need the average height by species and sex, I need to copy and paste the code, and modify it, and in some cases modify it substantially;
  • this code handles missing values in a cumbersome way, with nested if...else...s;
  • this code is not easy to test;
  • this code cannot be composed (meaning, chained) with other code without substantially altering it (to be precise, chaining and composing are two different things, strictly speaking, but for simplicity’s sake, let’s just assume it is the same. Whenever I’m talking about “composing” something, I mean “chaining” something.)

But it’s not just shortcomings, this imperative code has one advantage; it uses only some very fundamental building blocks: if...else..., for loops and that’s almost it (it does use some functions provided by a base installation of R, namely is.na(), !(), unlist() and [(), so strictly speaking, the code above is not purely imperative, but maybe closer to being procedural?).

Using functions solves all the issues from imperative programming. Here is a base solution to the problem above, using a declarative, or functional, approach:

aggregate(starwars$height,
          by = list(starwars$species == "Human"),
          FUN = \(x)(mean(x, na.rm = TRUE)))
##   Group.1        x
## 1   FALSE 172.4043
## 2    TRUE 176.6452

This code has many advantages:

  • what this code does is obvious, but only if you know what aggregate() does. But if you read its documentation you’ll know, and you’ll know every time you’ll see aggregate() unlike a loop like the loop above where you’ll have to read it each time to understand;
  • this code is reusable. Replace the data frame by another, and that’s it;
  • Missing values are now ignored easily using the na.rm argument of mean();
  • this code is easy to test (using unit tests);
  • this code can be composed, for instance like this:
aggregate(starwars$height,
          by = list(starwars$species == "Human"),
          FUN = \(x)(mean(x, na.rm = TRUE))) |>
  setNames(c("is_human", "mean_height"))
##   is_human mean_height
## 1    FALSE    172.4043
## 2     TRUE    176.6452

The issue with the functional approach (at least that’s the issue that many people I spoke to about this raise) is that… in some way people that don’t like this approach feel like they “lose” control over what’s going on. You don’t know what happens inside these functions. I remember, while working my first job, that my boss required that I don’t use any functions nor packages, but instead write all the loops explicitely, because she wanted to understand what was going on (of course, I completely ignored this request and just did as I pleased). As discussed above, the imperative approach requires minimum knowledge of the language, and almost anyone with an ounce of programming experience can understand imperative code. That’s not the case with a functional approach. Readers will have to be familiar with the individual functions like aggregate(), but also anonymous functions (I had to use \(x)(mean(x, na.rm = TRUE)) to set na.rm = TRUE, which is FALSE by default) and also |> for composition/chaining.

It may same more complex, and maybe it is, but the advantages far outweigh the shortcoming.

For completeness, here is a {dplyr} version:

starwars %>%
  group_by(is_human = species == "Human") %>%
  summarise(mean_height = mean(height, na.rm = TRUE))
## # A tibble: 3 × 2
##   is_human mean_height
##   <lgl>          <dbl>
## 1 FALSE           172.
## 2 TRUE            177.
## 3 NA              181.

{dplyr} code is even more concise than base functional code. Here again, users will have to know about the individual functions and %>%. But personally, I think that the only hurdle is understanding what %>% does, and once you know this, {dplyr} code can be understood quite easily, thanks to very explicit function names.

So functions are great. They’re easy to test, easy to document, easy to package, easy to reuse, and easy to compose. Composition is really important. For example, let’s go back to the imperative code, and put the result in a neat data frame object, like the functional solutions do:

sum_humans <- 0
sum_others <- 0
n_humans <- 0
n_others <- 0

for(i in seq_along(1:nrow(starwars))){

  if(!is.na(unlist(starwars[i, "species"])) &
     unlist(starwars[i, "species"]) == "Human"){
    if(!is.na(unlist(starwars[i, "height"]))){
      sum_humans <- sum_humans + unlist(starwars[i, "height"])
      n_humans <- n_humans + 1
    } else {

      0

    }

  } else {
    if(!is.na(unlist(starwars[i, "height"]))){
      sum_others <- sum_others + unlist(starwars[i, "height"])
      n_others <- n_others + 1
    } else {
      0
    }
  }
}

mean_height_humans <- sum_humans/n_humans
mean_height_others <- sum_others/n_others

# These two lines are new
data.frame(list("is_human" = c(TRUE, FALSE),
           "mean_height" = c(mean_height_others, mean_height_humans)))
##   is_human mean_height
## 1     TRUE    172.9400
## 2    FALSE    176.6452

It’s just two lines (right at the end), but the implications are huge; because imperative code cannot be composed, I had to write separate code to put the result into a data frame. More code that I need to write, more opportunities for mistakes. I actually did a mistake, did you notice? This kind of mistake could go unnoticed for eons. But if you use functions, you don’t have this problem, and can focus on getting (even complex) things done:

starwars %>%
  filter(skin_color == "light") %>%
  select(species, sex, mass) %>%
  group_by(sex, species) %>%
  summarise(
    total_individuals = n(),
    min_mass = min(mass, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE),
    sd_mass = sd(mass, na.rm = TRUE),
    max_mass = max(mass, na.rm = TRUE)
  ) %>%
  select(-species) %>%
  tidyr::pivot_longer(-sex, names_to = "statistic", values_to = "value")
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 10 × 3
## # Groups:   sex [2]
##    sex    statistic         value
##    <chr>  <chr>             <dbl>
##  1 female total_individuals   6  
##  2 female min_mass           45  
##  3 female mean_mass          56.3
##  4 female sd_mass            16.3
##  5 female max_mass           75  
##  6 male   total_individuals   5  
##  7 male   min_mass           79  
##  8 male   mean_mass          90.5
##  9 male   sd_mass            19.8
## 10 male   max_mass          120

Needless to say, trying to write the above code using only for loops and if...else... is not something I’d wish to do, especially passing the result of all the {dplyr} calls to pivot_longer(). Creating that last data frame by hand is error prone, and there would definitely be mistakes in there.

I hope I don’t need to convince you any more that functions are great, and that one of the great things they offer is their ability to be chained, or composed. But strictly speaking, you don’t need them. You could write your code without any function whatsoever, and use the most basic building blocks there are (loops and if...else... and little more). However, doing this would result in much messier code. It’s the same with monads. You can live without them. But there will be situations where not using them will result in messier code.

One more thing: as I was writing this blog post, I happened on this tweet:

#RStats question: What wheel are they trying to re-invent? pic.twitter.com/tsoAfCHroK

— Tokhir Dadaev (@zx8754) April 10, 2022

This is a fine example of all that I’ve been discussing until now. The person who wrote this code was very likely trying to get the diagonal elements of a matrix. That person was likely a beginner in R and used for loops to try to get the answer. We have all been there; what I’m trying to articulate is this: imperative programming can be useful, but it can get messy very quickly…

When functions are not enough

Functions are awesome, but there are situations which functions simply can’t easily deal with. Situations in which you would like your functions to do a little extra more, and the only way forward you see is to rewrite them to do something totally unrelated. For example, suppose you would like to time your code. Most people would to something such as:

tic <- Sys.time()
starwars %>%
  filter(skin_color == "light") %>%
  select(species, sex, mass) %>%
  group_by(sex, species) %>%
  summarise(
    total_individuals = n(),
    min_mass = min(mass, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE),
    sd_mass = sd(mass, na.rm = TRUE),
    max_mass = max(mass, na.rm = TRUE)
  ) %>%
  select(-species) %>%
  tidyr::pivot_longer(-sex, names_to = "statistic", values_to = "value")
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 10 × 3
## # Groups:   sex [2]
##    sex    statistic         value
##    <chr>  <chr>             <dbl>
##  1 female total_individuals   6  
##  2 female min_mass           45  
##  3 female mean_mass          56.3
##  4 female sd_mass            16.3
##  5 female max_mass           75  
##  6 male   total_individuals   5  
##  7 male   min_mass           79  
##  8 male   mean_mass          90.5
##  9 male   sd_mass            19.8
## 10 male   max_mass          120
toc <- Sys.time()

(running_time <- toc - tic)
## Time difference of 0.04110336 secs

You could totally do that. But now you’re back to square one. You have to deal with this tic-toc nonsense separately, have to keep track it, overburdening you mentally and polluting your code. To keep track of it, you’ll want to add the running times in a separate data frame, in which you could have all the running times of all your operations you need to run:

data.frame(list("operations" = seq(1:3),
                "running_time" = c(running_time, running_time * 2, running_time * 3)))
##   operations    running_time
## 1          1 0.04110336 secs
## 2          2 0.08220673 secs
## 3          3 0.12331009 secs

This data frame is the consequence of this tic-toc nonsense not being composable and now you have to deal with it, but you don’t want to. So what now? You might be tempted to do something like this:

tic_filter <- function(...){

  tic <- Sys.time()

  result <- filter(...)

  toc <- Sys.time()

  message("Running time: ", toc - tic)

  return(result)

}

starwars %>%
  tic_filter(species == "Human")
## Running time: 0.00474810600280762
## # A tibble: 35 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  6 Biggs D…    183    84 black      light      brown           24   male  mascu…
##  7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
##  8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
##  9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
## 10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
## # … with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

But that’s actually worse: not only do you have to change all the functions you need, and wrap them around tic-toc, but the running time is only shown as a message, so you can’t reuse it. You could then try to rewrite the function like this:

tic_filter <- function(...){

  tic <- Sys.time()

  result <- filter(...)

  toc <- Sys.time()

  running_time <- toc - tic

  list("result" = result,
       "running_time" = running_time)

}

starwars %>%
  tic_filter(species == "Human")
## $result
## # A tibble: 35 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  6 Biggs D…    183    84 black      light      brown           24   male  mascu…
##  7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
##  8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
##  9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
## 10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
## # … with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
## 
## $running_time
## Time difference of 0.004680395 secs

At least now you save the running time along with the object. But the problem of rewriting many functions remains, and these rewritten {dplyr} functions now return a list, and not a data frame anymore so something like this:

starwars %>%
  tic_filter(species == "Human") %>%
  tic_select(species, sex)

wouldn’t work, because tic_select() expects a data frame, not a list where the first element is a data frame and the second a double.

So what else can be done? Perhaps you’d be tempted to use a global variable for this:

tic_filter <- function(..., running_time = 0){

  tic <- Sys.time()

  result <- filter(...)

  toc <- Sys.time()

  running_time <<- toc - tic + running_time

  result

}

Functions written like this would save the running time in a global variable called running_time and each of them would take turns overwriting it:

running_time <- 0

one <- starwars %>%
  tic_filter(species == "Human", running_time = running_time)

running_time
## Time difference of 0.004715443 secs
two <- one %>%
  tic_select(species, sex, running_time = running_time)

running_time
## Time difference of 0.006836653 secs

(I defined tic_select() but am not showing it here.)

This has the advantage that the wrapped functions now return data frames as well, and can thus be composed/chained. But these functions are not pure functions, because they change something (the global variable running_time) outside their scope. Impure functions can be tricky; for instance here, because the code keeps overwriting the same variable, if you run the whole script and then separate chunks to try some things, running_time will keep getting incremented. Once again, you have to be extra careful and keep track of it, once again overburdening you mentally.

The solution

The solution to this problem looks like one of the previous things we tried, namely:

tic_filter <- function(...){

  tic <- Sys.time()

  result <- filter(...)

  toc <- Sys.time()

  running_time <- toc - tic

  list("result" = result,
       "running_time" = running_time)

}

While it is true that it returns a list, this function has the yuge advantage of being pure. But still, we need to solve two problems:

  • how to avoid having to rewrite every function;
  • how to compose these functions so that the output of one function can be ingested as the input of the next.

Solving the first problem consists in writing a new function that builds functions, what Hadley Wickham calls function factories. Let’s try:

timeit <- function(.f, ..., running_time = 0){

  function(..., .running_time = running_time){

    tic <- Sys.time()

    result <- .f(...)

    toc <- Sys.time()

    list(result = result,
         running_time = toc - tic + .running_time)
  }


}

timeit() is a function that takes a function (and its arguments as an input), and returns a new function. This function returns the result of the original function (.f) evaluated on its arguments (...) as well as the time it took to run as a list. You’ll notice as well that this function takes another argument, called running_time with a default value of 0. This will become useful below, for now, ignore it.

t_sqrt <- timeit(sqrt)

t_sqrt(10)
## $result
## [1] 3.162278
## 
## $running_time
## Time difference of 7.152557e-06 secs

That’s great, but we can’t compose these functions. This fails:

t_log <- timeit(log)

10 |>
  t_sqrt() |>
  t_log()
Error in .f(...) : non-numeric argument to mathematical function

because t_log() expects a number, not a list. The solution? Write another functions to help! Let’s call this function bind:

bind <- function(.l, .f, ...){

  .f(.l$result, ..., .running_time = .l$running_time)

}

bind() takes a list object returned by a timed function (.l, with elements $result and $running_time) and applies another timed function .f() to the $result element of .l as well as any further arguments ... and finally sets the running_time argument of .f equal to .l$running_time. .l$running_time is the running time of the previous timed function call, so now this running time gets added to the running time of .f (see the definition of the list of timeit()).

An example might help:

t_log <- timeit(log)

10 |>
  t_sqrt() |>
  bind(t_log)
## $result
## [1] 1.151293
## 
## $running_time
## Time difference of 8.177757e-05 secs

What’s nice with this solution, is that it works with any function:

t_filter <- timeit(filter)
t_select <- timeit(select)
t_group_by <- timeit(group_by)
t_summarise <- timeit(summarise)
t_p_longer <- timeit(tidyr::pivot_longer)

starwars %>%
  t_filter(skin_color == "light") %>% # no need to use bind here
  bind(t_select, species, sex, mass) %>%
  bind(t_group_by, sex, species) %>%
  bind(t_summarise,
    total_individuals = n(),
    min_mass = min(mass, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE),
    sd_mass = sd(mass, na.rm = TRUE),
    max_mass = max(mass, na.rm = TRUE)
  ) %>%
  bind(t_select, -species) %>%
  bind(t_p_longer, -sex, names_to = "statistic", values_to = "value")
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## $result
## # A tibble: 10 × 3
## # Groups:   sex [2]
##    sex    statistic         value
##    <chr>  <chr>             <dbl>
##  1 female total_individuals   6  
##  2 female min_mass           45  
##  3 female mean_mass          56.3
##  4 female sd_mass            16.3
##  5 female max_mass           75  
##  6 male   total_individuals   5  
##  7 male   min_mass           79  
##  8 male   mean_mass          90.5
##  9 male   sd_mass            19.8
## 10 male   max_mass          120  
## 
## $running_time
## Time difference of 0.08988643 secs

There is some overhead compared to the solution that simply calls tic at the beginning of all the {dplyr} calls and then toc at the end, but this overhead becomes negligible the longer the base operations run for. And now the advantage is that you don’t have to think about keeping track of running times. Re-running separate chunks will also not interfere with the running time of any other chunk.

Monads

So here we are, ready to learn what monads are, or rather, we’re done, because you already know what monads are. The solution described before is a monad:

  • a function factory to create functions that return a special, wrapped value (here it simply was a list of elements $result and $running_time). This wrapped value is also called a monadic value.
  • a function to compose, or chain, these special functions together.

Some other pieces can be added to the list, and one would need to check so-called monadic laws to make extra sure we’re dealing with a monad, but that’s outside the scope of this blog post.

There are many monads, for instance the so-called Maybe monad, available on R thanks to Andrew McNeil who implemented this monad as an R package. I have also developed a monad for logging (which also logs execution time), which I called {chronicler}, read more about it here.

To conclude, why did I title this post why you should(n’t) care about Monads if you’re an R programmer? The reason is that you can live without monads. However, certain things will be more complex if you don’t know about monads or if you don’t want to use them, just like functions. If for some reason you don’t use functions in your code, your life will be more complicated. So should you go ahead and start using monads in your code? Well, maybe (hehe) you should, especially if you’re doing the same thing over and over again, like timing your code. Maybe using a monad to time your code could be a nice solution, especially if you’ve been burned in the past by using the other, sub-optimal solutions?

Extra reading

If this blog post was not enough to satiate your curiosity, here are some more nice resources:

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!

  • Buy me an Espresso
  • To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.