Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
TL;DR
- C++ templates and function overloading are incompatible with R’s C API, so polymorphism must be achieved via run-time dispatch, handled explicitly by the programmer.
- The traditional technique for operating on
SEXP
objects in a generic manner entails a great deal of boilerplate code, which can be unsightly, unmaintainable, and error-prone.- The desire to provide polymorphic functions which operate on vectors and matrices is common enough that Rcpp provides the utility macros
RCPP_RETURN_VECTOR
andRCPP_RETURN_MATRIX
to simplify the process.- Subsequently, these macros were extended to handle an (essentially) arbitrary number of arguments, provided that a C++11 compiler is used.
Background
To motivate a discussion of polymorphic functions, imagine that we desire a
function (ends
) which, given an input vector x
and an integer n
, returns
a vector containing the first and last n
elements of x
concatenated.
Furthermore, we require ends
to be a single interface which is capable of
handling multiple types of input vectors (integers, floating point values,
strings, etc.), rather than having a separate function for each case. How can
this be achieved?
R Implementation
A naïve implementation in R might look something like this:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -0.560476 -0.230177 0.701356 -0.472791
The simple function above demonstates a key feature of many dynamically-typed
programming languages, one which has undoubtably been a significant factor in their
rise to popularity: the ability to write generic code with little-to-no
additional effort on the part of the developer. Without getting into a discussion
of the pros and cons of static vs. dynamic typing, it is evident that being able
to dispatch a single function generically on multiple object types, as opposed to,
e.g. having to manage separate impementations of ends
for each vector type,
helps us to write more concise, expressive code. Being an article about Rcpp,
however, the story does not end here, and we consider how this problem might
be approached in C++, which has a much more strict type system than R.
C++ Implementation(s)
For simplicity, we begin by considering solutions in the context of a “pure”
(re: not called from R) C++ program. Eschewing more complicated tactics
involving run-time dispatch (virtual
functions, etc.), the C++ language
provides us with two straightforward methods of achieving this at compile time:
- Function Overloading (ad hoc polymorphism)
- Templates (parametric polymorphism)
The first case can be demonstrated as follows:
Although the above program meets our criteria, the code duplication is profound. Being seasoned C++ programmers, we recognize this as a textbook use case for templates and refactor accordingly:
This approach is much more maintainable as we have a single implementation
of ends
rather than one implementation per typedef
. With this in hand, we
now look to make our C++ version of ends
callable from R via Rcpp.
Rcpp Implementation (First Attempt)
Many people, myself included, have attempted some variation of the following at one point or another:
Sadly this does not work: magical as Rcpp attributes may be, there are limits to what they can do, and at least for the time being, translating C++ template functions into something compatible with R’s C API is out of the question. Similarly, the first C++ approach from earlier is also not viable, as the C programming language does not support function overloading. In fact, C does not support any flavor of type-safe static polymorphism, meaning that our generic function must be implemented through run-time polymorphism, as touched on in Kevin Ushey’s Gallery article Dynamic Wrapping and Recursion with Rcpp.
Rcpp Implementation (Second Attempt)
Armed with the almighty TYPEOF
macro and a SEXPTYPE cheatsheat, we
modify the template code like so:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -1.067824 -0.217975 -0.305963 -0.380471
Warning in ends(list()): Invalid SEXPTYPE 19 (VECSXP). NULL
Some key remarks:
- Following the ubiquitous Rcpp idiom, we have converted our
ends
template to use an integer parameter instead of a type parameter. This is a crucial point, and later on, we will exploit it to our benefit. - The template implementation is wrapped in a namespace in order to avoid a naming conflict; this is a personal preference but not strictly necessary. Alternatively, we could get rid of the namespace and rename either the template function or the exported function (or both).
- We use the opaque type
SEXP
for our input / output vector since we need a single input / output type. In this particular situation, replacingSEXP
with the Rcpp typeRObject
would also be suitable as it is a generic class capable of representing anySEXP
type. - Since we have used an opaque type for our input vector, we must cast it
to the appropriate
Rcpp::Vector
type accordingly within each case label. (For further reference, the list of vector aliases can be found here). Finally, we could dress each return value inRcpp::wrap
to convert theRcpp::Vector
to aSEXP
, but it isn’t necessary because Rcpp attributes will do this automatically (if possible).
At this point we have a polymorphic function, written in C++, and callable from
R. But that switch
statement sure is an eyesore, and it will need to be
implemented every time we wish to export a generic function to R. Aesthetics
aside, a more pressing concern is that boilerplate such as this increases the
likelihood of introducing bugs into our codebase – and since we are leveraging
run-time dispatch, these bugs will not be caught by the compiler. For example,
there is nothing to prevent this from compiling:
// ... case INTSXP: { return impl::ends(as<CharacterVector>(x), n); } // ...
In our particular case, such mistakes likely would not be too disastrous, but it should not be difficult to see how situations like this can put you (or a user of your library!) on the fast track to segfault.
Obligatory Remark on Macro Safety
The C preprocessor is undeniably one of the more controversial aspects of the C++ programming language, as its utility as a metaprogramming tool is rivaled only by its potential for abuse. A proper discussion of the various pitfalls associated with C-style macros is well beyond the scope of this article, so the reader is encouraged explore this topic on their own. On the bright side, the particular macros that we will be discussing are sufficiently complex and limited in scope that misuse is much more likely to result in a compiler error than a silent bug, so practically speaking, one can expect a fair bit of return for relatively little risk.
Synopsis
At a high level, we summarize the RCPP_RETURN
macros as follows:
- There are two separate macros for dealing with vectors and matrices,
RCPP_RETURN_VECTOR
andRCPP_RETURN_MATRIX
, respectively. - In either case, code is generated for the following
SEXPTYPE
s:INTSXP
(integers)REALSXP
(numerics)RAWSXP
(raw bits)LGLSXP
(logicals)CPLXSXP
(complex numbers)STRSXP
(characters / strings)VECSXP
(lists)EXPRSXP
(expressions)
- In C++98 mode, each macro accepts two arguments:
- A template function
- A
SEXP
object
- In C++11 mode (or higher), each macro additionally accepts zero or more arguments which are forwarded to the template function.
Finally, the template function must meet the following criteria:
- It is templated on a single, integer parameter.
- In the C++98 case, it accepts a single
SEXP
(or something convertible toSEXP
) argument. - In the C++11 case, it may accept more than one argument, but the first argument is subject to the previous constraint.
Examining our templated impl::ends
function from the previous section, we see
that it meets the first requirement, but fails the second, due to its second
parameter n
. Before exploring how ends
might be adapted to meet the (C++98)
template requirements, it will be helpful demonstrate correct usage with a few
simple examples.
Fixed Return Type
We consider two situations where our input type is generic, but our output type is fixed:
- Determining the length (number of elements) of
a vector, in which an
int
is always returned. - Determining the dimensions (number of rows and number of columns)
of a matrix, in which a length-two
IntegerVector
is always returned.
First, our len
function:
(Note that we omit the return
keyword, as it is part of the macro definition.)
Testing this out on the various supported vector types:
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Similarly, creating a generic function that determines the dimensions of an input matrix is trivial:
And checking this against base::dim
,
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
everything seems to be in order.
It’s worth pointing out that, for various reasons, it is possible to pass a
matrix object to an Rcpp function which calls RCPP_RETURN_VECTOR
:
[1] 9
[1] 9
Although this is sensible in the case of len
– and even saves us from
implementing a matrix-specific version – there may be situations where
this behavior is undesirable. To distinguish between the two object types we
can rely on the API function Rf_isMatrix
:
[1] 9
<Rcpp::exception in len2(matrix(1:9, 3)): matrix objects not supported.>
We don’t have to worry about the opposite scenario, as this is already handled within Rcpp library code:
<Rcpp::not_a_matrix in dims(1:5): Not a matrix.>
Generic Return Type
In many cases our return type will correspond to our input type. For example,
exposing the Rcpp sugar function rev
is trivial:
[1] 5 4 3 2 1
[[1]] [1] 5+2i [[2]] [1] 4+2i [[3]] [1] 3+2i [[4]] [1] 2+2i [[5]] [1] 1+2i
[1] "edcba"
As a slightly more complex example, suppose we would like to write a function
to sort matrices which preserves the dimensions of the input, since
base::sort
falls short of the latter stipulation:
[1] 1 2 3 4 5 6 7 8 9
There are two obstacles we need to overcome:
- The
Matrix
class does not implement its ownsort
method. However, sinceMatrix
inherits fromVector
, we can sort the matrix as aVector
and construct the result from this sorted data with the appropriate dimensions. - As noted previously, the
RCPP_RETURN
macros will generate code to handle exactly 8SEXPTYPE
s; no less, no more. Some functions, likeVector::sort
, are not implemented for all eight of these types, so in order to avoid a compilation error, we need to add template specializations.
With this in mind, we have the following implementation of msort
:
Note that elements will be sorted in column-major order since we filled our
result using this constructor. We can verify that msort
works as intended by checking a few test cases:
[,1] [,2] [,3] [1,] 1 7 4 [2,] 3 9 6 [3,] 5 2 8
[,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9
[1] 1 2 3 4 5 6 7 8 9
[,1] [,2] [1,] "a" "y" [2,] "c" "b" [3,] "z" "x"
[,1] [,2] [1,] "a" "x" [2,] "b" "y" [3,] "c" "z"
[1] "a" "b" "c" "x" "y" "z"
List of 9 $ : int 1 $ : int 2 $ : int 3 $ : int 4 $ : int 5 $ : int 6 $ : int 7 $ : int 8 $ : int 9 - attr(*, "dim")= int [1:2] 3 3
<Rcpp::exception in msort(x): sort not allowed for lists.>
<simpleError in sort.int(x, na.last = na.last, decreasing = decreasing, ...): 'x' must be atomic>
Revisiting the ‘ends’ Function
Having familiarized ourselves with basic usage of the RCPP_RETURN
macros, we
can return to the problem of implementing our ends
function with
RCPP_RETURN_VECTOR
. Just to recap the situation, the template function
passed to the macro must meet the following two criteria in C++98 mode:
- It is templated on a single, integer parameter (representing the
Vector
type). - It accepts a single
SEXP
(or convertible toSEXP
) argument.
Currently ends
has the signature
meaning that the first criterion is met, but the second is not. In order
preserve the functionality provided by the int
parameter, we effectively
need to generate a new template function which has access to the user-provided
value at run-time, but without passing it as a function parameter.
The technique we are looking for is called partial function application, and it can be implemented
using one of my favorite C++ tools: the functor. Contrary to typical functor
usage, however, our implementation features a slight twist: rather than
using a template class with a non-template function call operator, as is the
case with std::greater
, etc., we are
going to make operator()
a template itself:
Not bad, right? All in all, the changes are fairly minor:
- The function body of
Ends::operator()
is identical to that ofimpl::ends
. n
is now a private data member rather than a function parameter, which gets initialized in the constructor.- Instead of passing a free-standing template function to
RCPP_RETURN_VECTOR
, we pass the expressionEnds(n)
, wheren
is supplied at run-time from the R session. In turn, the macro will invokeEnds::operator()
on theSEXP
(RObject
, in our case), using the specifiedn
value.
We can demonstrate this on various test cases:
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] -0.694707 -0.207917 0.123854 0.215942
A Modern Alternative
As alluded to earlier, a more modern compiler (supporting C++11 or later)
will free us from the “single SEXP
argument” restriction, which means
that we no longer have to move additional parameters into a function
object. Here is ends
re-implemented using the C++11 version of
RCPP_RETURN_VECTOR
(note the // [[Rcpp::plugins(cpp11)]]
attribute declaration):
[1] 1 2 3 4 6 7 8 9
[1] "a" "b" "c" "x" "y" "z"
[1] 0.379639 -0.502323 0.181303 -0.138891
The current definition of RCPP_RETURN_VECTOR
and RCPP_RETURN_MATRIX
allows for up
to 24 arguments to be passed; although in principal, the true upper bound
depends on your compiler’s implementation of the __VA_ARGS__
macro, which
is likely greater than 24. Having said this, if you find yourself trying
to pass around more than 3 or 4 parameters at once, it’s probably time
to do some refactoring.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.