How to easily automate R analysis, modeling and development work using CI/CD, with working examples
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Automating the execution, testing and deployment of R work is a very powerful tool to ensure the reproducibility, quality and overall robustness of the code that we are building, be it for data analysis and modeling purposes, developing R packages or even blogging. Modern tools also provide a free an easy to use way of achieving this goal.
In this post, we will show a quick and simple way to automate R data analysis and package development checking, testing and installation with GitLab CI/CD and provide example files that can be used for testing packages and deploying blogdown-based websites.
Contents
A quick overview of CI/CD and GitLab’s approach to it
In this paragraph, we will try to introduce GitLab CI/CD and it prerequisites in very simple and practical terms, at the cost of technical precision. Terms will be linked to relevant pages for those interested in precise definitions. We will also focus on using the CI/CD provided directly on GitLab. It is also possible to use it on your own infrastructure, but this is out of the scope of this introductory post.
What is CI/CD and how to use it with Gitlab
Continuous integration (CI) and Continuous deployment (CD) are IT practices that encourage checking and testing code often (e.g. on every change pushed to a repository) and being able to provide the resulting product (e.g. an application) to the users automatically.
For the purpose of this post, we will be focusing on R code and will be happy with the CI/CD
- detecting any code changes that we make in our repository
- automatically running a set of actions that we define when changes are detected
What is GitLab CI/CD, what can it do
- GitLab CI/CD is a service provided by GitLab that makes using basic CI/CD easy even for non-IT professionals, such as R users
- Free for both public and private repositories hosted on GitLab
- Can execute a wide variety of tasks, ranging from executing custom scripts, deployment of Java applications, building Docker images, checking and testing R packages to publishing blogs and more
What are the prerequisites, how to make it work?
- To use GitLab CI/CD your project’s code should be hosted on GitLab
- To make it work, you need to create a yaml file called
.gitlab-ci.yml
with the instructions and push it to the root directory of your project’s repository - Once you do that, instructions in that yaml file will be executed by GitLab automatically each time you push a code change to the repository. Different triggers such as specified times, etc. can also be used
Why GitLab? What if my code is on GitHub?
This post is by no means supposed to be an advertisement for GitLab, I chose it some time ago for 2 very simple reasons
- It allowed for free private repositories, which is now also true for GitHub
- The CI/CD is fully integrated, with no need for other tools
If you use GitHub, the favorite CI tool for R code hosted there seems to be Travis. Some examples specific for R can be found here. You can also read a more generic Travis CI Tutorial.
The simplest example with R use
To make the post a bit less abstract and more practical, here is an overly simplified example of GitLab CI/CD used with R, which just runs the current version of R and prints the mtcars
dataset:
- The repository
- The .gitlab-ci.yml file
- An overview of pipeline runs
- An example output of a run
Let’s have a look at this very simplistic .gitlab-ci.yml
:
image: r-base test: script: - R -e 'print(datasets::mtcars)'
We see the following:
image: r-base
tells GitLab CI/CD to use the r-base Docker image for the run – more on that later- the rest of the yaml tells GitLab CI/CD to run a job named
test
, its task is to execute ascript
defined asR -e 'print(datasets::mtcars)'
, meaning just to run R and print themtcars
dataset
Now let us take a look at a more useful example for developing R packages.
Example pipeline for R package testing and deployment
We can use GitLab CI/CD to automatically
- Build our package
- Perform the
R CMD check
and investigate if we have any errors, warnings or notes - Run our unit tests
- Check our testing coverage and finally
- Install the package and potentially use it to perform more actions
An example .gitlab-ci.yml with a pipeline based on a Docker image to test an R package can look as follows. Note that this is most likely overkill and too spacious, one could have a pipeline that is way shorter for this purpose:
image: jozefhajnala/rdev:3.4.4 stages: - build - document - check - test - deploy variables: _R_CHECK_CRAN_INCOMING_: "false" _R_CHECK_FORCE_SUGGESTS_: "true" CODECOV_TOKEN: "2329aed3-de38-468c-9a06-95564363211c" before_script: - apt-get update buildbinary: stage: build script: - r -e 'devtools::build(binary = TRUE)' documentation: stage: document script: - r inst/ci/document.R checkerrors: stage: check script: - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["errors"]], character(0))) stop("Check with Errors")' checkwarnings: stage: check script: - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["warnings"]], character(0))) stop("Check with Warnings")' checknotes: stage: check script: - r -e 'if (!identical(devtools::check(document = FALSE, args = "--no-tests")[["notes"]], character(0))) stop("Check with Notes")' unittests: stage: test script: - r -e 'if (any(as.data.frame(devtools::test())[["failed"]] > 0)) stop("Some tests failed.")' codecov: stage: test script: - r -e 'covr::codecov()' install: stage: deploy script: - r -e 'devtools::install()'
Now let’s again take a look at the content:
image
– a docker image to use for the pipelinestages
– defines the ordering of job execution, jobs of the same stage are run in parallel, jobs of the next stage are run after the jobs from the previous stage complete successfully. Stages are the “columns” of the chart below.variables
– used to pass environment variables to the jobsbefore_script
– used to define commands that should be run before all jobs. An example is installing a needed R package that is not contained in our Docker image.
The rest are jobs definitions. Each job has a stage
, which defines in which stage it is ran, where multiple jobs can be included in one stage. A script
essentially defines what to do. For R uses, this can usually be:
Rscript -e '<r commands>'
to execute R commands specified between the quotesRscript pathtoscript.R
to execute a script stored in a file- for littler users, we can replace
Rscript
withr
for similar purposes as above
Docker images for R users and developers
As we have seen, most of the pipelines start with image: <image>
. This tells GitLab to use the specified Docker image for the run, which is extremely useful because a suitable Docker image will include all the software that we need to execute our analyses, modeling or other tasks, without us having to install that software within the .yaml file. Example of such software available in an image for R use is, obviously, R and other dependencies such as additional packages.
If you would like to read more about Docker, Colin Fay has you covered with his Introduction to Docker for R Users. For now, let’s just assume that using this image provides GitLab with a place that has R (and all needed packages) installed and can run the specified scripts for us.
One of the great things about Docker images is that they are easy to share and adapt. A huge thank you and kudos go to Carl Boettiger and Dirk Eddelbuettel, who maintain the Rocker project which provides a collection of images suited for different R needs built on Debian.
My personal favorites from the Rocker project are
- r-ver images – providing an environment fixed in time, including using a specifically dated MRAN repository. Have a look at their Dockerfiles on GitHub. The image used by my CI pipeline for testing packages is adapted from r-ver:3.4.4.
- r-base for the current version of base R. Have a look at the Dockerfile on GitHub
TL;DR: Just show it to me in action
In case you are only interested in seeing the CI/CD pipeline work in action for some R uses, you can look at:
The simplest example using R
- The repository
- The .gitlab-ci.yml file
- An overview of the pipeline runs
- An example output of a run
The mentioned R package testing
- The .gitlab-ci.yml file for the
jhaddins
package on branch develop - An overview of the pipeline runs
- An example output of a successful run
- An example output of a run that discovered a NOTE in the check process
Building a Docker image
The Docker image jozefhajnala/rdev:3.4.4
used above
- is based on the following Dockerfile
- is also built with GitLab CI/CD, look at the .gitlab-ci.yml file
- the build pipeline in action
Publishing a Hugo-based blogdown blog
Also, this blog itself is deployed on a schedule via GitLab CI/CD, using a file very similar to the following:
- Example Hugo site deployed with GitLab CI/CD
- The .gitlab-ci.yml
References
- GitLab Continuous Integration documentation
- GitLab CI/CD Pipeline Configuration Reference
- Colin Fay’s Introduction to Docker for R Users
- Building an R Project with Travis CI
- Docker images for R on the Rocker Project
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.