Site icon R-bloggers

The prequel to the drake R package

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The drake R package is a pipeline toolkit. It manages data science workflows, saves time, and adds more confidence to reproducibility. I hope it will impact the landscapes of reproducible research and high-performance computing, but I originally created it for different reasons. This post is the prequel to drake’s inception. There was struggle, and drake was the answer.

Dissertation frustration


Sisyphus. https://sites.google.com/site/sisyphusa/

My dissertation project was intense. The final computational challenge was to analyze multiple genomics datasets using an emerging method and its competitors. Even with GPU computing, which shrank days of runtime down to hours, the full battery of Markov chain Monte Carlo runs took several weeks from start to finish. I organized my workflow as an R package, and I worked in a loop:

  1. Deploy the computations.
  2. Wait a few weeks.
  3. Discover an issue.
  4. Restart from scratch.

At the time, the dominant R-focused workflow managers could not break the cycle. Knitr was designed to weave together code and prose, and its paradigm deliberately lacked enough modularity to properly scale. ProjectTemplate was mostly concerned with organization and readability. These and similar tools had only traces of the functionality I sought. For the right solution, I needed to step off the beaten path.

GNU Make

paper.pdf: paper.tex figure.png 
    pdflatex paper.tex
    
figure.png: figure.R results.csv
    Rscript figure.R
    
results.csv: long-computation.R
    Rscript long-computation.R

GNU Make is a dependency watcher first and foremost. Its top priority is to bring results up to date with as little work as possible, and it gives you parallel computing for free. In fact, according to Karl Broman and others, Make surpasses even knitr as a helpful reproducible research tool.

My advisor, Jarad Niemi, repeatedly urged me to use Make. At the time, I was too entrenched in half-written code to transition, so I finished my graduate school work with brute force. I defended my thesis, moved out of Iowa, and because I had goofed up the paperwork, remained a graduate student for one last summer. Jarad and I used most of our remaining time to find and create better tools for future students. Jarad started a Make-based project template, and I looked for existing solutions. I liked the idea of Make, but I hoped to find something more scalable and friendlier for R-based projects.

Remake


Person pushing a boulder. https://sites.google.com/site/sisyphusa/

Rich FitzJohn’s remake package package was nearly ideal. Almost totally R-focused, remake tracked changes more discerningly than Make. However, it lacked high-performance computing support, and it required a cumbersome YAML configuration file to list all the steps of the analysis. So I wrote two sidekick packages: one to deploy jobs in parallel and another to generate large remake-style YAML files. With remake and its add-ons, my post-thesis wrap-up work was steady and smooth, a breeze compared to the thesis itself. Data science projects suddenly became much more fun.

Drake


Hexagon logo. https://github.com/ropensci/drake

I originally intended to contribute to remake. I wanted to inject life back into development, and I wanted to see it on CRAN. However, I was not experienced enough with the problem, and I did not understand remake’s internals. I began drake as a learning exercise, and it quickly morphed into a beast of its own, friendly and fully scalable. Drake is by far my most gratifying project from 2017, and it is still a joy to maintain and proselytize.

A taste

# install.packages("drake")                  # Latest CRAN release, or
# devtools::install_github("ropensci/drake") # The development version
library(drake)

# The basic example explores a trend in the mtcars dataset.
load_basic_example()   # Get the code with drake_example("basic").
config <- drake_config(my_plan)
outdated(config)        # Which targets need to be (re)built?
make(my_plan)          # Build the right things.
outdated(config)        # Everything is up to date.
reg2 <- function(d){    # Change a dependency.
  d$x3 <- d$x ^ 3
  lm(y ~ x3, data = d)
}
outdated(config)        # Some targets are now out of date.
vis_drake_graph(config) # Interact with the graph. Hover, click, drag...

Acknowledgements

I started collaborating with Kirill Müller on drake at RStudio::conf(2018), and we are working to take it to an entirely new level of performance and ease of use. Drake’s future is brighter and I am a better software developer because of his coaching sessions over those four days alone.

I would also like to thank Kirill for his drake pitch during RStudio::conf(2018) and Jenny Bryan for including it in her workshop on “What They Forgot to Teach You About R”. Their time and generosity boosted drake’s presence and popularity overnight.

Many thanks to Ben Marwick, Julia Lowndes, and Peter Slaughter for reviewing drake for rOpenSci, and to Maëlle Salmon for such active involvement and encouragement as the editor.

Thanks also to the following people for contributing early in development.

And of course, special thanks to Jarad, who originally set me on this path.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.