Reproducible data science with Nix, part 6 — CI/CD has never been easier

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Warning: I highly recommend you read this blog post first, which will explain how to run a pipeline inside Nix in detail. This blog post will assume that you’ve read that one, and it would also help if you’re familiar with Github Actions, if not, read this other blog post of mine as well

This is getting ridiculous. The meme that I’m using as a header for this blog post perfectly summaries how I feel.

This will be a short blog post, because Nix makes things so easy that there’s not much to say. I wanted to try how I could use Nix on Github Actions to run a reproducible pipeline. This pipeline downloads some data, prepares it, and fits a machine learning model. It is code that I had laying around from an old video on the now deprecated {drake} package, {targets} predecessor.

You can find the pipeline here and you can also take a look at the same pipeline but which uses Docker here for comparison purposes.

What I wanted to achieve was the following: I wanted to set up a reproducible environment with Nix on my computer, work on my pipeline locally, and then have it run on Github Actions as well. But I wanted my pipeline to run exactly on the same environment as the one I was using to develop it. In a world without Nix, this means using a mix of {renv} (or {groundhog} or {rang}) and a Docker image that ships the right version of R. I would then need to write a Github Actions workflow file that builds that Docker image, then runs it and saves the outputs as artifacts. Also, in practice that image would not be exactly the same as my local environment: I would have the same version of R and R packages, but every other system-level dependency would be a different version unless I use that Dockerized environment to develop locally, something I suggested you should do merely 4 months ago (oooh, how blind was I!).

With Nix, not only can I take care of the version of R and R packages with one single tool but also every underlying system-level dependency gets handled by Nix. So if I use a package that requires, say, Java, or GDAL, or any other of these usual suspects that make installing their R bindings so tricky, Nix will handle this for me without any intervention on my part. I can also use this environment to develop locally, and then, once I’m done working locally, exactly this environment, exactly every bit of that environment, will get rebuilt and used to run my code on Github Actions.

So this is the repository where you can find the code. There’s a {targets} script that defines the pipeline and a functions/ folder with some code that I wrote for said pipeline. What’s unfamiliar to you (unless you’ve been reading my Nix adventures since the beginning) is the default.nix file:

let
 pkgs = import (fetchTarball "https://github.com/NixOS/nixpkgs/archive/976fa3369d722e76f37c77493d99829540d43845.tar.gz") {};
 rpkgs = builtins.attrValues {
  inherit (pkgs.rPackages) tidymodels vetiver targets xgboost;
};
 system_packages = builtins.attrValues {
  inherit (pkgs) R;
};
in
 pkgs.mkShell {
  buildInputs = [  rpkgs system_packages  ];
 }

This few lines of code define an environment that pulls packages from revision 976fa3369d722e76f37c77493d99829540d43845 of nixpkgs. It installs the packages {tidymodels}, {vetiver}, {targets} and {xgboost} (actually, I’m not using {vetiver} for this yet, so it could even be removed). Then it also installs R. Because we’re using that specific revision of Nix, exactly the same packages (and their dependencies) will get installed, regardless of when we build this environment. I want to insist that this file is 12 lines long and it defines a complete environment. The equivalent Dockerfile is much longer, and not even completely reproducible, and I would have needed external tools like {renv} (or use the Posit CRAN mirror dated snapshots) as you can check out here.

Let’s now turn our attention to the workflow file:

name: train_model

on:
  push:
    branches: [main]

jobs:
  targets:
    runs-on: ubuntu-latest
    env:
      GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
    steps:

      - uses: actions/checkout@v3

      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@main
        with:
          logger: pretty
          log-directives: nix_installer=trace
          backtrace: full

      - name: Nix cache
        uses: DeterminateSystems/magic-nix-cache-action@main

      - name: Build development environment
        run: |
          nix-build

      - name: Check if previous runs exists
        id: runs-exist
        run: git ls-remote --exit-code --heads origin targets-runs
        continue-on-error: true

      - name: Checkout previous run
        if: steps.runs-exist.outcome == 'success'
        uses: actions/checkout@v2
        with:
          ref: targets-runs
          fetch-depth: 1
          path: .targets-runs

      - name: Restore output files from the previous run
        if: steps.runs-exist.outcome == 'success'
        run: |
          nix-shell default.nix --run "Rscript -e 'for (dest in scan(\".targets-runs/.targets-files\", what = character())) {
            source <- file.path(\".targets-runs\", dest)
            if (!file.exists(dirname(dest))) dir.create(dirname(dest), recursive = TRUE)
            if (file.exists(source)) file.rename(source, dest)
          }'"

      - name: Run model
        run: |
          nix-shell default.nix --run "Rscript -e 'targets::tar_make()'"

      - name: Identify files that the targets pipeline produced
        run: git ls-files -mo --exclude=renv > .targets-files

      - name: Create the runs branch if it does not already exist
        if: steps.runs-exist.outcome != 'success'
        run: git checkout --orphan targets-runs

      - name: Put the worktree in the runs branch if the latter already exists
        if: steps.runs-exist.outcome == 'success'
        run: |
          rm -r .git
          mv .targets-runs/.git .
          rm -r .targets-runs

      - name: Upload latest run
        run: |
          git config --local user.name "GitHub Actions"
          git config --local user.email "[email protected]"
          rm -r .gitignore .github/workflows
          git add --all -- ':!renv'
          for file in $(git ls-files -mo --exclude=renv)
          do
            git add --force $file
          done
          git commit -am "Run pipeline"
          git push origin targets-runs

      - name: Prepare failure artifact
        if: failure()
        run: rm -rf .git .github .targets-files .targets-runs

      - name: Post failure artifact
        if: failure()
        uses: actions/upload-artifact@main
        with:
          name: ${{ runner.os }}-r${{ matrix.config.r }}-results
          path: .

The workflow file above is heavily inspired from the one you get when you run targets::tar_github_actions(). Running this puts the following file on the root of your {targets} project. This file is a Github Actions workflow file, which means that each time you push your code on Github, the pipeline will run in the cloud. However it needs you to use {renv} with the project so that the right packages get installed. You’ll also see a step called Install Linux dependencies which you will have to adapt to your project.

All of this can be skipped when using Nix. All that must be done is installing Nix itself, using the nix-installer-action from Determinate Systems, then using the magic-nix-cache-action which caches the downloaded packages so we don’t need to wait for the environment to build each time we push (unless we changed the environment of course) and that’s about it. We then build the environment on Github Actions using nix-build and then run the pipeline using nix-shell default.nix --run "Rscript -e 'targets::tar_make()'". All the other steps are copied almost verbatim from the linked file above and make sure that the computed targets only get recomputed if I edit anything that impacts them, and also that they get pushed into a branch called targets-runs. I say copied almost verbatim because some steps must run inside R, so we need to specify that we want to use the R that is available through the Nix environment we just built.

Now, each time we push, the following happens:

  • if we didn’t change anything to default.nix, the environment gets retrieved from the cache. If we did change something, then environment gets rebuilt (or rather, only the parts that need to be rebuilt, the rest will still get retrieved from the cache)
  • if we didn’t change anything to the _targets.R pipeline itself, then every target will get skipped. If not, only the targets that need to get recomputed will get recomputed.

One last thing that I didn’t mention: on line 9 you’ll see this:

runs-on: ubuntu-latest

this means that the Github Actions will run on the latest available version of Ubuntu, which is obviously not fixed. When the next LTS gets released in April 2024, this pipeline will be running on Ubuntu 24.04 instead of the current LTS, version 22.04. This is not good practice because we don’t want the underlying operating system to be changing, because this could have an impact on the reproducibility of our pipeline. But with Nix, this does not matter. Remember that we are using a specific revision of nixpkgs for our pipeline, so the exact same version of not only R and R packages gets installed, but every underlying piece of software that needs to be available will be installed as well. We could be running this in 50 years on Ubuntu LTS 74.04 and it would still install the same stuff and run the same code and produce exactly the same results.

This is really bonkers.

Nix is an incredibly powerful tool. I’ve been exploring and using it for 3 months now, but if something impresses me more than how useful it is, is how terribly unknown it still is. I hope that this series of blog posts will motivate other people to learn it.

Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebooks. You can also watch my videos on youtube. So much content for you to consoom!

Buy me an EspressoBuy me an Espresso

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)