Double dispatch in R: S4-vs-vctrs

Your Name

1 year ago

[This article was first published on krzjoa, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why do we may need double dispatch?

In most cases, when writing R scripts or even creating R packages, it is enough to use standard functions or S3 methods. However, there is one important field that forces us to consider double dispatch question: arithmetic operators.

Suppose we’d like to create a class, which fits the problem we’re currently working on. Let’s name such class beer.

beer <- function(type){
  structure(list(type  = type),class = "beer")
}

opener <- function(){
  structure(list(), class = "opener")
}

pilsner <- beer("pilnser")
my_opener <- opener()

Then, we create an operator which defines some non-standard behaviour.

if we add an opener to the beer, we get an opened_beer.
adding a numeric x, we get a case of beers (which even contain a negative number of bees, i.e. our owe…)
if second argument is different than a or opener or numeric, we get… untouched beer

Let’s demonstrate, how does it work:

`+.beer` <- function(a, b){
  if (inherits(b, "opener")) {
        return(structure(list(
          name  = paste("opened", a$name)
    ), class = "opened_beer"))
  } else if (inherits(b, "numeric")) {
    print("It's magic! You've got a case of beers!")
    return(structure(list(
        n_beers = 1 + b
    ), class = "case_of_beers"))
  } else {
    return(a)
  }
}

pilsner + my_opener

## $name
## [1] "opened "
## 
## attr(,"class")
## [1] "opened_beer"

pilsner + -0.1

## [1] "It's magic! You've got a case of beers!"

## $n_beers
## [1] 0.9
## 
## attr(,"class")
## [1] "case_of_beers"

Don’t you think, that such operations should be commutative?

my_opener + pilsner

## list()
## attr(,"class")
## [1] "opener"

What did happen here? This is an example of the way the R interpreter handles arithmetic operator. It was described with details on Hiroaki Yutani’s blog. Briefly speaking, in this particular case R engine matched method to the second argument (not to the first one), because there is no +.opener S3 method. What about such trick:

`+.opener` <- function(a, b) b + a

After that, the result is different:

my_opener + pilsner

## Warning: Incompatible methods ("+.opener", "+.beer") for "+"

## Error in my_opener + pilsner: non-numeric argument to binary operator

We crashed our function call. When both objects have the + method defined and these methods are not the same, R is trying to resolve the conflict by applying an internal +. It obviously cannot work. This case could be easily solved using more ‘ifs’ in the +.beer beer function body. But let’s face a different situation.

-0.1 + pilsner

## [1] -0.1

What a mess! Simple S3 methods are definitely not the best solution when we need the double dispatch.

S4 class: a classic approach

To civilize such code, we can use classic R approach, S4 methods. We’ll start from S4 classes declaration.

.S4_beer          <- setClass("S4_beer", representation(type = "character"))
.S4_opened_beer   <- setClass("S4_opened_beer", representation(type = "character"))
.S4_opener        <- setClass("S4_opener", representation(ID = "numeric"))
.S4_case_of_beers <- setClass("S4_case_of_beers", representation(n_beers = "numeric"))

Then, we can two otptions, how to handle + operators. I didn’t mention about it in the previous example, but both S3 and S4 operators are grouped as so-called group generic functions (learn more: S3, S4).

We can set a S4 method for a single operator and that looks as follows:

setMethod("+", c(e1 = "S4_beer", e2 = "S4_opener"),
          function(e1, e2){
  if (inherits(e2, "S4_opener")) {
        return(.S4_opened_beer(type  = paste("opened", e1@type)))
  } else if (inherits(e2, "numeric")) {
    print("It's magic! You've got a case of beers!")
    return(.S4_case_of_beers(n_beers = 1 + e2))
  } else {
    return(e1)
  }
})

setMethod("+", c(e1 = "S4_opener", e2 = "S4_beer"),
          function(e1, e2) e2 + e1)

Alternatively, we can define a method for Arith geneneric and check, what method is exactly called at the moment. I decided to use the second approach, because it’s more similar to the way the double dispatch is implemented in the vctrs library.

.S4_fun <- function(e1, e2){
  if (inherits(e2, "S4_opener")) {
        return(.S4_opened_beer(type  = paste("opened", e1@type)))
  } else if (inherits(e2, "numeric")) {
    print("It's magic! You've got a case of beers!")
    return(.S4_case_of_beers(n_beers = 1 + e2))
  } else {
    return(e1)
  }
}

setMethod("Arith", c(e1 = "S4_beer", e2 = "S4_opener"),
          function(e1, e2)
          {
            op = .Generic[[1]]
            switch(op,
                   `+`  = .S4_fun(e1, e2),
                    stop("undefined operation")
            )
})

setMethod("Arith", c(e1="S4_opener", e2="S4_beer"),
          function(e1, e2)
          { 
            op = .Generic[[1]]
            switch(op,
                   `+`  = e2 + e1,
                    stop("undefined operation")
            )
})

Let’s create our class instances and do a piece of math.

S4_pilsner <- .S4_beer(type = "Pilsner")
S4_opener <- .S4_opener(ID = 1)

S4_pilsner + S4_opener

## An object of class "S4_opened_beer"
## Slot "type":
## [1] "opened Pilsner"

S4_opener + S4_pilsner

## An object of class "S4_opened_beer"
## Slot "type":
## [1] "opened Pilsner"

Declared methods are clear, and, the most important: they work correctly.

vctrs library: a tidyverse approach

vctrs is an interesting library, thought as a remedy for a couple of R disadvantages. It delivers, among others, a custom double-dispatch system based on well-known S3 mechanism.

At the first step we declare class ‘constructors’.

library(vctrs)

.vec_beer <- function(type){
  new_vctr(.data = list(type  = type), class = "vec_beer")
}

.vec_opened_beer <- function(type){
  new_vctr(.data = list(type  = type), class = "vec_opened_beer")
}

.vec_case_of_beers <- function(n_beers){
  new_vctr(.data = list(n_beers  = n_beers), class = "vec_case_of_beers")
}

.vec_opener <- function(){
  new_vctr(.data = list(), class = "vec_opener")
}

Then, we create class instances.

vec_pilsner   <- .vec_beer("pilnser")
vec_opener <- .vec_opener()
print(class(vec_pilsner))

## [1] "vec_beer"   "vctrs_vctr"

print(class(vec_opener))

## [1] "vec_opener" "vctrs_vctr"

At the end, we write a double-dispatched methods in vctrs style. As you can see,

.fun <- function(a, b){
  if (inherits(b, "vec_opener")) {
        return(.vec_opened_beer(type  = paste("opened", a$type)))
  } else if (inherits(b, "numeric")) {
    print("It's magic! You've got a case of beers!")
    return(.vec_case_of_beers(n_beers = 1 + b))
  } else {
    return(a)
  }
}

vec_arith.vec_beer <- function(op, x, y, ...) {
  UseMethod("vec_arith.vec_beer", y)
}

vec_arith.vec_opener <- function(op, x, y, ...) {
  UseMethod("vec_arith.vec_opener", y)
}

vec_arith.vec_beer.vec_opener <- function(op, x, y, ...){
  switch(op,
         `+` = .fun(x, y),
         stop_incompatible_op(op, x, y)
  )
}

vec_arith.vec_opener.vec_beer <- function(op, x, y, ...){
  y + x
} 

vec_pilsner + vec_opener

## <vec_opened_beer[1]>
##           type 
## opened pilnser

vec_opener + vec_pilsner

## <vec_opened_beer[1]>
##           type 
## opened pilnser

It works properly, too.

Benchmark

I’ve created all the classes and methods above not only to demonstate, how to implement double dispatch in R. My main goal is to benchmark both approaches and check, which one has smaller overhead. The hardware I used for the test looks as follows:

## $vendor_id
## [1] "GenuineIntel"
## 
## $model_name
## [1] "Intel(R) Core(TM) i3 CPU       M 350  @ 2.27GHz"
## 
## $no_of_cores
## [1] 4

## 8.19 GB

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=pl_PL.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=pl_PL.UTF-8        LC_COLLATE=pl_PL.UTF-8    
##  [5] LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=en_US.utf8    
##  [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] vctrs_0.2.3
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3            benchmarkmeData_1.0.3 knitr_1.23           
##  [4] magrittr_1.5          tidyselect_0.2.5      doParallel_1.0.15    
##  [7] lattice_0.20-38       R6_2.4.0              rlang_0.4.2          
## [10] foreach_1.4.7         httr_1.4.1            stringr_1.4.0        
## [13] dplyr_0.8.3           tools_3.6.1           parallel_3.6.1       
## [16] grid_3.6.1            xfun_0.9              htmltools_0.3.6      
## [19] iterators_1.0.12      yaml_2.2.0            digest_0.6.25        
## [22] assertthat_0.2.1      tibble_2.1.3          benchmarkme_1.0.3    
## [25] crayon_1.3.4          Matrix_1.2-17         purrr_0.3.3          
## [28] codetools_0.2-16      glue_1.3.1            evaluate_0.14        
## [31] rmarkdown_1.14        stringi_1.4.3         pillar_1.4.2         
## [34] compiler_3.6.1        pkgconfig_2.0.2

It’s my good old notebook, which is not a beast.

library(microbenchmark)
library(ggplot2)

Beer + opener

bm1 <- microbenchmark(
  s4 = S4_pilsner + S4_opener,
  s3_vec = vec_pilsner + vec_opener,
  times = 1000
)

## Unit: microseconds
##    expr     min       lq      mean   median       uq      max neval
##      s4 153.292 158.2120 178.40541 161.4225 165.6375 5506.681  1000
##  s3_vec  56.686  60.1265  69.52364  68.9240  70.8830  163.278  1000

Opener + beer

bm2 <- microbenchmark(
  s4 = S4_opener + S4_pilsner,
  s3_vec = vec_opener + vec_pilsner,
  times = 1000
)

## Unit: microseconds
##    expr     min       lq      mean   median       uq      max neval
##      s4 159.512 164.6735 191.74781 168.9655 176.3165 6068.477  1000
##  s3_vec  71.110  78.5835  96.22535  86.6720  89.4015 4796.377  1000

Bonus: opener + beer vs addtion of numerics

bm3 <- microbenchmark(
  simple_R = 1 + 2,
  s4 = S4_opener + S4_pilsner,
  s3_vec = vec_opener + vec_pilsner,
  times = 1000
)

## Unit: nanoseconds
##      expr    min       lq      mean   median     uq    max neval
##  simple_R    130    344.0    697.49    744.5    857   2862  1000
##        s4 158769 164522.5 189297.35 169270.5 198120 375648  1000
##    s3_vec  74775  78395.5  94786.28  87192.5  94085 258129  1000

Conclusions

It seems that vctrs-based performs better than traditional S4 methods. Obviously, I checked only one operation and probably some edge cases may exists. However, I think that it shows us some direction, what execution time we can expect.

Further sources

If you are interesting, how to implement double-dispatched operators in S4, I encourage you to get familiar with code of the following R libraries:

Matrix
gpuR

If you are looking for some examples of vctrs, I recommend you to learn the source code of:

rray
banana (a funny toy package)

To leave a comment for the author, please follow the link and comment on their blog: krzjoa.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.