Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve had some discussions online and in the real world about this blog post and I’d like to restate why containerization is needed for reproducibility, and do so from the lens of functional programming.
When setting up a pipeline, wether you’re a functional programming enthusiast or not, you’re aiming at setting it up in a way that this pipeline is the composition of (potentially) many referentially transparent and pure functions.
As a reminder:
referentially transparent functions are functions that always return the same output for the same given input. So for example
f(x, y):=x+y
is referentially transparent, buth(x):=x+y
is not. Becausey
is not an input ofh
,h
will look fory
in the global environment. Depending on the value of y,h(1)
might equal 10 one day, but 100 the next. Let’s say thatf(1, 10)
is always equal to 11. Because this is true, you could replacef(1, 10)
everywhere it appears with 11. But consider the following example of a function that is not referentially transparent,rnorm()
. Tryrnorm(1)
several times… It will always give a different result! This is becausernorm()
looks for the seed in the global environment and uses that to generate a random number.pure functions are functions without side effects. So a function just does its thing, and does not interact with anything else; doesn’t change anything in the global environment, doesn’t print anything on screen, doesn’t write anything to disk. Basically, pure functions are functions that do nothing else but computing stuff. Now this may seem limiting, and to some extent it is, so we will need to relax this a bit: we’ll be ok with functions that output stuff, but only the very last function in the pipeline will be allowed to do it.
To be pure, a function needs to be referentially transparent.
Ok so now that we know what referentially transparent and pure functions are, let’s explain
why we want a pipeline to be a composition of such functions.
Function composition is an operation that takes two functions g and f and
returns a new function h such that h(x) = g(f(x))
. Formally:
h = g ∘ f such that h(x) = g(f(x))
∘
is the composition operator. You can read g ∘ f
as g after f. In R,
you can compose functions very easily, simply by using |> or %>%:
h <- f |> g
f |> g
can be read as f then g, which is equivalent to g after f (ok, using |>
is chaining
rather than composing functions, but the net effect is the same).
So h
would be our complete pipeline, which would be the composition, or chaining, of as many
functions as needed:
h <- a |> b |> c |> d ... |> z
If all the functions are pure (and referentially transparent) then we’re assured that h
will
always produce the same outputs for the same inputs. As stated above, z
will be allowed to not
be pure an actually output something (like a rendered Quarto document) to disk. Ok so that’s great,
and all, but why does the title of this blog post say that containerization is needed?
The problem is that all the functions we use have “hidden” inputs, and are never truly referentially transparent. These inputs are the following:
- Version of R (or whatever programming language you’re using)
- Versions of the packages you’re using
- Operating system and its version (and all the different operating system dependencies that get used at run- or compile time)
For example, let’s take a look at this function:
f <- function(x){ if (c(TRUE, FALSE)) x }
which will return the following on R 4.1 (which was released on May 2021):
f(1) [1] 1 Warning message: In if (c(TRUE, FALSE)) 1 : the condition has length > 1 and only the first element will be used
So a result 1 and a warning. On R 4.2.2 (the current version as of writing), the exact same call returns:
Error in if (c(TRUE, FALSE)) 1 : the condition has length > 1
These types of breaking changes are rare in R, at least to my knowledge (I’m actually looking into
this in greater detail, 2023 will likely be the year I show my findings), but in this case it
illustrates my point: code that was behaving in a certain way started behaving in another way, even
though nothing changed. What changed was the version of R, even though the function itself was pure.
This wouldn’t be so surprising if instead of f(x)
, the function was something like f(x, r_version)
.
In this case, the calls above would be something like:
f(1, r_version = "4.1")
and this would always return:
[1] 1 Warning message: In if (c(TRUE, FALSE)) 1 : the condition has length > 1 and only the first element will be used
but changing the call to this:
f(1, r_version = "4.2.2")
would return the error:
Error in if (c(TRUE, FALSE)) 1 : the condition has length > 1
regardless of the version of R we’re running, so our function would be referentially transparent.
Alas, this is not possible, at least not like this.
Hence why tools like Docker, Podman (a Docker alternative) or Guix (which I learned about recently but never used, yet, and as far as I know, not a containerization solution, but a solution actually based on functional programming) are crucial to ensure that your pipeline is truly reproducible. Basically, using Docker you turn the hidden inputs defined before (versions of tools and OS) explicit. Take a look at this Dockerfile:
FROM rocker/r-ver:4.1.0 RUN R -e "f <- function(x){if (c(TRUE, FALSE)) x};f(1)" CMD ["R"]
here’s what happens when you build it:
➤ docker build -t my_pipeline . Sending build context to Docker daemon 2.048kB Step 1/3 : FROM rocker/r-ver:4.1.0 4.1.0: Pulling from rocker/r-ver eaead16dc43b: Already exists 35eac095fa03: Pulling fs layer c0088a79f8ab: Pulling fs layer 28e8d0ade0c0: Pulling fs layer Digest: sha256:860c56970de1d37e9c376ca390617d50a127b58c56fbb807152c2e976ce02002 Status: Downloaded newer image for rocker/r-ver:4.1.0 ---> d83268fb6cda Step 2/3 : RUN R -e "f <- function(x){if (c(TRUE, FALSE)) x};f(1)" ---> Running in a158e4ab474f R version 4.1.0 (2021-05-18) -- "Camp Pontanezen" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > f <- function(x){if (c(TRUE, FALSE)) x};f(1) [1] 1 Warning message: In if (c(TRUE, FALSE)) x :> > the condition has length > 1 and only the first element will be used Removing intermediate container a158e4ab474f ---> 49e2eb20a535 Step 3/3 : CMD ["R"] ---> Running in ccda657c4d95 Removing intermediate container ccda657c4d95 ---> 5a432adbe6ff Successfully built 5a432adbe6ff Successfully tagged my_package:latest
as you can read from above, this starts the container with R version 4.1.0 and runs the code in it. We get back our result with the warning (it should be noted that in practice, you would structure your Dockerfile differently for running an actual pipeline).
This Dockerfile starts by using rocker/r-ver:4.1 as a basis. You can find this
image in the versioned
repository from the Rocker Project. This base image starts off from Ubuntu Focal Fossa
so (Ubuntu version 20.04), uses R version 4.1.0 and even uses frozen CRAN repository as
of 2021-08-09. It then runs our pipeline (or in this case, our simple function) in this, fixed
environment. Our function essentially became f(x, os_version, r_version, packages_version)
instead of
just f(x)
. By changing the first statement of the Dockerfile:
FROM rocker/r-ver:4.1.0
to this:
FROM rocker/r-ver:3.5.0
we can even do some archaeology and run the pipeline on R version 3.5.0! This has great potential and hopefully one day Docker or similar solution will become just another tool in scientists/analysts toolbox.
If you want to start using Docker for your projects, I’ve written this tutorial and even a whole ebook.
Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!