Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book Hands-On Programming with R: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain).
It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.
We will discuss this programming pattern and how to use it effectively.
Object oriented R
Simulating objects with the “function returning list of functions” pattern
In Hands-On Programming with R Garrett Grolement recommends a programming pattern of building a function that returns a list of functions. This is a pretty powerful pattern that uses a “closures” to get make a convenient object oriented programming pattern available to the R user.
At first this might seem unnecessary: R claims to already have many object oriented systems: S3, S4, and RC. But none of these conveniently present object oriented behavior as a programmer might expect from more classic object oriented languages (C++, Java, Python, Smalltalk, Simula …).
What are “objects”?
Like it or not object oriented programming is a programming style centered around sending messages to mutable objects. Roughly in object oriented programming you expect the following. There are data items (called objects, best thought of as “nouns”) that carry type information, a number of values (fields, like a structure), and methods or functions (which are sometimes thought of as verbs or messages). We expect objects to implement the following:
- polymorphism: The same method or function call may have different implementations depending on the runtime type of one or more of its arguments. This allows important separation of concerns and, generic composition. Users of an object claiming to model a 2d region that has an
area()
method don’t need to know if they are dealing with a square or a circle and therefore can be mode to work over both types of shapes. - encapsulation: fields can be hidden from casual outside observers. This allows changes of implementation, as well behaved outside code can restrict its interactions to working with only publicly exposed methods and fields.
- mutability: It is expected that some functions/methods are “verbs” or “messages” that cause fields in the object to change value. Immutable values are very popular in functional programming, and their certainly are such things as immutable objects. But the orientation of object oriented programming has historically been objects that change state in response to messages (such as: “increment customer count.”)
- inheritance: objects can easily delegate parts of their implementation and declared method interfaces to other objects.
Standard R objects
None of the common object systems in R conveniently offer the majority of these behaviors, the issues are:
- S3: polymorphism is a name-lookup hack associated more with methods than objects, there is no encapsulation, fields are immutable (as almost R structures are), and while objects can declare more than one class there is no real inheritance.
- S4: considered a unreliable and expensive attempt model C++’s object system. Not recommended by many R experts and style guides. For example from the Google R style guide: “avoid S4 objects and methods when possible; never mix S3 and S4″.
- RC: reference object system. So different from expectations in the rest of R should not be used unless you have a specific need for it.
Immutability
One thing that might surprise some readers (even though familiar with R) is we said almost all R objects are immutable. At first glance this doesn’t seem to be the case consider the following:
a <- list() print(1) ## list() a$b <- 1 print(a) ## $b ## [1] 1
The list “a” sure seemed to change. In fact it did not, this is an illusion foisted on you by R using some clever variable rebinding. Let’s look at that code more closely:
library('pryr') ## a <- list() print(address(a)) [1] "0x1059c5dc0" a$b <- 1 ## print(address(a)) ## [1] "0x105230668"
R simulated a mutation or change on the object “a” by re-binding a new value (the list with the extra argument) to the symbol “a” in the environment we were executing in. We see this by the address change, the name “a” is no longer referring to the same value. “Environment” is a computer science term meaning a structure that binds variable names to values. R is very unusual in that most R values are immutable and R environments are mutable (what value a variable refers to get changed out from under you). At first glance R appears to be adding an item to our list “a”, but in fact what is doing is changing the variable name “a” to refer to an entirely new list that has one more element.
This is why we say S3 objects are in fact immutable when the appear to accept changes. The issue is if you attempt to change an S3 object only the one reference in your current environment will see the change, any other references bound to the original value will keep their binding to the original value and not see any update. For the most part this is good. It prevents a whole slough of “oops I only wanted to update my copy during calculation but clobbered everybody else’s value” bugs. But it also means you can’t easily use S3 objects to share changing state among different processes.
Closures: “poor man’s objects”
There are some cases where you do want shared changing state. Garrett uses a nice example of drawing cards, we will use a simple example of assigning sequential IDs. Consider the following code:
idSource <- function() { nextId <- 1 list(nextID=function() { r <- nextId nextId <<- nextId + 1 r }) } source <- idSource() source$nextID() ## [1] 1 source$nextID() ## [1] 2
The idea is the following: in R a fresh environment (that is the structure binding variable names to values) is created during function evaluation. Any function created while evaluating our outer function has access to all variables in this environment (this environment is what is called a closure). So any names that appear free in the inner function (that is variable names that don’t have a definition in the inner function) end up referring to variable in the this new environment (or one of its parents if there is no name match). Since environments are mutable re-binding values in this secret environment gives us mutable slots. The first gotcha is the need to use <<-
or assign()
to effect changes in the secret environment.
This behaves a lot more like what Java or Python programmer would expect from an object and is fully idiomatic R. So if you want object-like behavior this is a tempting way to get it.
Encapsulation and inheritance
So we have shared mutable state and polymorphism, what about encapsulation and inheritance?
Essentially we do have encapsulation, you can’t find the data fields unless you deliberately poke around in the functions environments. The data fields are not obvious list elements, so we can consider them private.
Inheritance is a bit weaker. At best we could get what is called prototype inheritance if when we created a list of functions we started with a list of default functions that we pass through all of which do not get their names overridden by our new functions.
This is only “safety by convention” (so a different breed of object orientedness than Java, but similar to Python and Javascript where you can examine raw fields easilly).
Problems with R closures
There is one lingering problem with using R environments as closures: they can leak references causing unwanted memory bloat. The reason is as with so many things with R the implementation of closures is explicitly exposed to the user. This means we can’t say “a closure is the binding of free variables at the time a function was defined” (the more common usage of static or lexical closure), but instead “R functions simulate a closure by keeping an explicit reference to the environment that was active when the function was defined.” This allows weird code like the following:
f <- function() { print(x) } x <- 5 f() [1] 5
In many languages the inability to bind the name “x” to a value at the time of function definition would be a caught error. With R there is no error as long as some parent of the functions definition environment eventually binds some value to the name “x”.
But the real problem is that R keeps the whole environment around, including bits the interior function is not using. Consider the following code snippet:
library('biglm') d <- data.frame(x=runif(100000)) d$y <- d$x>=runif(nrow(d)) formula <- 'y~x' fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) list(predict=function(newd) { predict(model, newdata=newd, type='response')[,1] }) } model <- fitter(formula,d) print(head(model$predict(d)))
What we have done is used biglm
to build a logistic regression model. We are using the “function that returns a list of functions” pattern to build a new predict()
method that remembers to set the all-important type='response'
argument and use the [,1]
operator to convert biglm
‘s matrix return type into the more standard numeric vector return type. I.e. we are using these function wrappers to hide many of the quirks of the particular fitter (need a family argument during fit, needed a type argument during predict, and returning matrix instead of vector) without having to bring in a training control package (such as caret, caret is a good package- but you should know how to implement similar effects yourself).
The hidden problem is the following: the closure or environment of the model captures the training data causing this training data to be retained (possibly wasting a lot of memory). We can see that with the following code:
ls(envir=environment(model$predict)) ## [1] "d" "formula" "model"
This can be a big problem. A generalized linear model such as this logistic regression should really only cost storage proportional to the number of variables (in this case 1!). There is no reason to hold on to the entire data set after fitting. The leaked storage may not be obvious in all cases as the standard R size functions don’t report space used in sub-environments and the “use serialization to guess size trick” (length(serialize(model, NULL))
) doesn’t report the size of any objects in the global environment (so we won’t see the leak in this case where we ran fitter()
in the global environment, but we would see it if we had run fitter in a function). As we see below the model object is large.
sizeTest1 <- function() { model <- fitter(formula,d) length(serialize(model, NULL)) } sizeTest1() ## [1] 1227648
This is what we call a “reference leak.” R doesn’t tend to have memory leaks (it has a good garbage collector). But if you are holding a reference to an object you don’t need (and you may not even know you are holding the reference!) you have loss of memory that feels just like a leak.
Here is how to fix it: build a new restricted environment that has only what you need. Here is the code:
#' build a new funcion with a smaller environment #' @param f input function #' @param varaibles names we are allowing to be captured in the closere #' @return new function with closure restricted to varaibles #' @export restrictEnvironment <- function(f,varList) { oldEnv <- environment(f) newEnv <- new.env(parent=parent.env(oldEnv)) for(v in varList) { assign(v,get(v,envir=oldEnv),envir=newEnv) } environment(f) <- newEnv f } fitter <- function(formula,d) { model <- bigglm(as.formula(formula), d, family=binomial(link='logit')) model$family$variance <- c() model$family$dev.resids <- c() model$family$aic <- c() model$family$mu.eta <- c() model$family$initialize <- c() model$family$validmu <- c() model$family$valideta <- c() model$family$simulate <- c() environment(model$terms) <- new.env(parent=parent.env(environment())) list(predict= restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model')) }
The bulk of this code is us stripping large components out of the bigglm model. We have confirmed the model can still predict after this, though the summary functions are going to be broken. A lot of what we took out of the model are functions carrying environments that have a sneak reference to our data. We are not carrying multiple copies of the data, but we are carrying multiple references which will keep the data alive longer than we want. The part actually want to demonstrate was the following wrapper:
restrictEnvironment(function(newd) { predict(model, newdata=newd, type='response')[,1] }, 'model'))
What restrictEnvironment does is replace the functions captured environment with a new one containing only the variables we listed. In this case we only listed “model” as this is the only variable we actually want to retain a reference to. The neatening procedure is actually easy (except for when we have to clean items out of other people’s structures, as we had to here). Though there is the pain that since R doesn’t give you a list of the structures you need to retain (i.e. the list of unbound variable names in the inner function) you have to maintain this list by hand (which can get difficult if there are a lot of items, as if you list 10 you know you have forgotten one).
Trying to remember which objects to allow in the captured closure environment. (Steve Martin “The Jerk” 1979, copyright the producers.)
Related posts:
- R examine objects tutorial
- You don’t need to understand pointers to program using R
- Trimming the Fat from glm() Models in R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.