Site icon R-bloggers

Investigating Docker and R

[This article was first published on o2r project blog -- R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is regularly updated (cf. GH issue) and available under the URL http://bit.ly/docker-r. Last update: 11 Jan 2018.

Docker and R: How are they used and could they be used together? That is the question that we regularly ask ourself. And we try to keep up with other people’s work! In this post, we are going to share our insights with you.

Thanks to Ben Marwick for contributing to this post! You know about a project using Docker and R? Get in touch.

Dockerising R

Several implementations of besides the one by R-core exist today, together with numerous integrations into open source and proprietary software (cf. English and German Wikipedia pages). In the following we present the existing efforts for using open source R implementation with Docker.

Rocker

The most prominent effort< !--more--> in this area is the Rocker project (http://rocker-project.org/). It was initiated by Dirk Eddelbuettel and Carl Boettiger and containerises the main R implementation based on Debian. For an introduction, you may read their blog post here or follow this tutorial from rOpenSci.

With a big choice of pre-build Docker images, Rocker provides optimal solutions for those who want to run R from Docker containers. Explore it on Github or Docker Hub, and soon you will find out that it takes just one single command to run instances of either base R, R-devel or Rstudio Server. Moreover, you can run specific versions of R or use one of the many bundles with commonly used R packages and other software, namely tidyverse and rOpenSci).

Images are build monthly on Docker Hub, except devel tags which are build nightly. Automated builds are disabled, instead builds are triggered by CRON jobs running on a third party server (cf. GitHub comment).

Bioconductor

If you come from bioinformatics or neighboring disciplines, you might be delighted that Bioconductor provides several images based on Rocker’s rocker/rstudio images. See the help page, GitHub, and Open Hub for more information. In short, the Bioconductor core team maintains release and devel images (e.g. bioconductor/release_base2), and contributors maintain image with different levels of pre-installed packages (each in release and devel variants), which are based on Bioconductor views (e.g. bioconductor/devel_proteomics2 installs the views Proteomics and MassSpectrometryData).

Image updates occur with each Bioconductor release, except the devel images which are build weekly with the latest versions of R and Bioconductor based on rocker/rstudio-daily.

CentOS-based R containers

Jonathan Lisic works on a collection of Dockerfiles building on CentOS (6 and 7) and other operating systems as an alternative to the Debian-based Rocker stack. The Dockerfiles are on GitHub: https://github.com/jlisic/R-docker-centos

MRO

Microsoft R Open (MRO) is an “enhanced R distribution”, formerly known as Revolution R Open (RRO) before Revolution Analytics was acquired by Microsoft. MRO is compatible with main R and it’s packages. “It includes additional capabilities for improved performance, reproducibility, and platform support.” (source); most notably these are the MRAN repository a.k.a. CRAN Time Machine, which is also used by versioned Rocker images, and the (optional) integration with Intel® Math Kernel Library (MKL) for multi-threaded performance in linear algebra operations (BLAS and LAPACK).

o2r team member Daniel created a Docker image for MRO including MKL. It is available on Docker Hub as nuest/mro, with Dockerfile on GitHub. It is inspired by the Rocker images and can be used in the same fashion. Please note the extended licenses printed at every startup for MKL.

Jonathan Lisic published a Dockerfile for a CentOS-based MRO on GitHub.

Ali Zaidi published Dockerfiles on GitHub and images on Docker Hub for Microsoft R Client, which is based on MRO.

R Client adds to MRO by including a couple of “ScaleR” machine learning algorithms and packages for parallelisation and remote computing.

Renjin

Renjin is a JVM-based interpreter for the R language for statistical computing developed by BeDataDriven. It was developed for big data analysis using existing R code seamlessly in cloud infrastructures, and allows Java/Scala developers to easily combine R with all benefits of Java and the JVM.

While it is not primarily build for interactive use on the command line, this is possible. So o2r team member Daniel created a Docker image for Renjin for you to try it out. It is available on Docker Hub as nuest/renjin, with Dockerfile on GitHub.

pqR

pqR tries to create “a pretty quick version of R” and fixing some perceived issues in the R language. While this is a one man project by Radford Neal, it’s worth trying out such contributions to the open source community and to the discussion on how R should look like in the future (cf. a recent presentation), even if things might get personal. As you might have guess by now, Daniel created a Docker image for you to try out pqR: It is available on Docker Hub as nuest/pqr, with Dockerfile on GitHub.

[WIP] FastR

Also targeting performance, FastR is “is an implementation of the R Language in Java atop Truffle, a framework for building self-optimizing AST interpreters.” FastR is planned as a drop-in replacement for R, but relevant limitations apply.

While GraalVM has a Docker Hub user, no images are published probably because of licensing requirements, as can be seen in the GitHub repository oracle/docker-images, where users must manually download a GraalVM release, which requires an Oracle Account… so the current tests available in this GitHub repository, trying to build FastR from source based on the newest OpenJDK Java 9.

Dockerising Research and Development Environments

So why, apart from the incredibly easy usage, adoption and transfer of typical R environments, would you want to combine R with Docker?

Ben Marwick, Associate Professor at the University of Washington, explains in this presentation that it helps you manage dependencies. It gives a computational environment that is isolated from the host, and at the same time transparent, portable, extendable and reusable. Marwick uses Docker and R for reproducible research and thus bundles up his works to a kind of Research Compendium; an instance is available here, and a template here.

Carl Boettiger, Assistant Professor at UC Berkeley, wrote in detail about using Docker for reproducibility in his ACM SIGOPS paper ‘An introduction to Docker for reproducible research, with examples from the R environment’.

Both Ben and Carl contributed case studies using Docker for research compendia in the book “The Practice of Reproducible Research – Case Studies and Lessons from the Data-Intensive Sciences”: Using R and Related Tools for Reproducible Research in Archaeology and A Reproducible R Notebook Using Docker.

An R extension you may want to dockerise is Shiny. Flavio Barros dedicated two articles on R-bloggers to this topic: Dockerizing a Shiny App and Share Shiny apps with Docker and Kitematic. The majority of talks at useR!2017 presenting real-world deployments of Shiny mentioned using dockerised Shiny applications for reasons of scalability and ease of installation.

The company Seven Bridges provides an example for a public container encapsulating a specific research environment, in this case the product Seven Bridges Platform (“a cloud-based environment for conducting bioinformatic analyses”), its tools and the Bioconductor package sevenbridges. The published image sevenbridges/sevenbridges-r includes both RStudio Server and Shiny, see the vignette “IDE Container”.

A new solution to ease the creation of Docker containers for specific research environments is containerit. It creates Dockerfiles (using Rocker base images) from R sessions, R scripts, R Markdown files or R workspace directories, including the required system dependencies. The package was presented at useR!2017 and can currently only be installed from GitHub.

While Docker is made for running tools and services, and providing user interfaces via web protocols (e.g. via a local port and a website opened in a browser, as with rocker/rstudio or Jupyter Notebook images), several activities exists that try to package GUI applications in containers. Daniel explores some alternatives for running RStudio in this GitHub repository, just for the fun of it. In this particular case it may not be very sensible, because RStudio Desktop is already effectively a browser-based UI (unlike other GUI-based apps packages this way), but for users with reluctance to a browser UI and/or command line interfaces, the “Desktop in a container” approach might be useful.

Running Tests

The package dockertest makes use of the isolated environment that Docker provides: R programmers can set up test environments for their R packages and R projects, in which they can rapidly test their works on Docker containers that only contain R and the relevant dependencies. All of this without cluttering your development environment.

The package gitlabr does not use Docker itself, but wraps the GitLab API in R functions for easy usage. This includes starting continuous integration (CI) tests (function gl_ci_job), which GitLab can do using Docker, so the function has an argument image to select the image run to perform a CI task.

In a completely different vein but still in the testing context, sanitizers is an R package for testing the compiler setup across different compiler versions to detect code failures in sample code. This allows testing completely different environments on the same host, without touching the well-kept development environment on the host. The packages’ images are now deprecated and superseded by Rocker images (rocker/r-devel-san and rocker/r-devel-ubsan-clang).

Dockerising Documents and Workflows

Some works are dedicated to dockerising R-based documents.

The package liftr (on CRAN) for R lets users enhance Rmd files with YAML-metadata (example), which enables rendering R Markdown documents in Docker containers. Unlike containerit, this metadata must be written by the author of the R Markdown document.

liftr is used in the DockFlow initiative to containerise a selection of Bioconductor workflows as presented in this poster at BioC 2017 conference. Liftr also supports Rabix, a Docker-based toolkit for portable bioinformatics workflows. That means that users can have Rabix workflows run inside the container and have the results integrated directly into the final document.

The Bioconductor package sevenbridges (see also above) has a vignette on creating reproducible reports with Docker. In recommends a reproducible script or report with docopt respectively R markdown (parametrised reports). The cloud-based Seven Bridges platform can fulfill requirements, such as required Docker images, within their internal JSON-based workflow and “Tool” description format (example), for which the package provides helper functions to create Tools and execute them, see this example in a vignette. Docker images are used for local testing of these workflows based on Rabix (see above), where images are started automatically in the background for a user, who only uses R functions. Automated builds for workflows on Docker Hub are also encouraged.

RCloud is a collaborative data analysis and visualization platform, which you can not only try out online but also host yourself with Docker. Take a look at their Dockerfiles or try out their image rcl0ud/rcloud.

Control Docker Containers from R

Rather than running R inside Docker containers, it can be beneficial to call Docker containers from inside R. This is what the packages in this section do.

The harbor package for R (only available via GitHub) provides all Docker commands with R functions. It may be used to control Docker containers that run either locally or remotely.

A more recent alternative to harbor is the package docker, also available on CRAN with source code on GitHub. Using a DRY approach, it provides a thin layer to the Docker API using the Docker SDK for Python via the package reticulate. The package is best suited for apt Docker users, i.e. if you know the Docker commands and life cycle. However, thanks to the abstraction layer provided by the Docker SDK for Python, docker also runs on various operating systems (including Windows).

dockermachine provides a convenient R interface to the docker-machine command, so you can provision easily local or remote/cloud instances of containers.

Selenium provides tools for browser automation, which are also available as Docker images. They can be used, amongst others, for testing web applications or controlling a headless web browser from your favorite programming language. In this tutorial, you can see how and why you can use the package RSelenium to interact with your Selenium containers from R.

googleComputeEngineR provides an R interface to the Google Cloud Compute Engine API. It includes a function called docker_run that starts a Docker container in a Google Cloud VM and executes R code in it. Read this article for details and examples. There are similar ambitions to implement Docker capabilities in the analogsea package that interfaces the Digital Ocean API. googleComputeEngineR and analogsea use functions from harbor for container management.

R and Docker for Complex Web Applications

Docker, in general, may help you to build complex and scalable web applications with R.

If you already have a Shiny app, then Cole Brokamp’s package rize makes you just one function call away from building and viewing your dockerised Shiny application.

If you want to get serious with Shiny, take a look at ShinyProxy by Open Analytics. ShinyProxy is a Java application (see GitHub) to deploy Shiny applications. It creates a container with the Shiny app for each user to ensure scalability and isolation and has some other “enterprise” features.

Mark McCahill presented at an event of the Duke University in North Carolina (USA) how he provided 300+ students each with private RStudio Server instances. In his presentation (PDF / MOV (398 MB)), he explains his RStudio farm in detail.

If you want to use RStudio with cloud services, you may find delight in these articles from the SAS and R blog: RStudio in the cloud with Amazon Lightsail and docker, Set up RStudio in the cloud to work with GitHub, RStudio in the cloud for dummies, 2014/2015 edition.

The platform R-hub helps R developers with solving package issues prior to submitting them to CRAN. In particular, it provides services that build packages on all CRAN-supported platforms and checks them against the latest R release. The services utilise backends that perform regular R builds inside of Docker containers. Read the project proposal for details.

The package plumber (website, repository) allows creating web services/HTTP APIs in pure R. The maintainer provides a ready to use Docker image trestletech/plumber to run/host these applications with excellent documentation including topics such as multiple images under one port and load balancing.

Batch processing

The package batchtools (repository, JOSS paper) provides a parallel implementation of Map for HPC for different schedulers, including Docker Swarm. A job can be executed on a Docker cluster with a single R function call, for which a Docker CLI command is constructed as a string and executed with system2(..).

To leave a comment for the author, please follow the link and comment on their blog: o2r project blog -- R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.