Make1-like pipelines enhance the integrity, transparency, shelf life, efficiency, and scale of large analysis projects.
With pipelines, data science feels smoother and more rewarding, and the results are worthy of more trust.
< !--html_preserve-->
< !-- I thought about following the blog guide's recommendation to use the Hugo shortcode for this tweet, but I feel the media and emojis are a bit much. -->
…looking to get your project/s organised in the new year? hoping just to distract from feelings of impending doom/crushing loss of hope? I promise workflowing will make you feel better… and @wmlandau has made it SO EASY.
In targets, a data analysis pipeline is a collection of target objects that express the individual steps of the workflow, from upstream data processing to downstream R Markdown reports5.
These targets live in a special script called _targets.R.
# _targets.R file
library(targets)
tar_option_set(packages = c("biglm", "dplyr", "ggplot2", "readr"))
# Most workflows have custom functions to support the targets.
read_clean <- function(path) {
path %>%
read_csv(col_types = cols()) %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}
fit_model <- function(data) {
biglm(Ozone ~ Wind + Temp, data)
}
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone), bins = 12) +
theme_gray(24)
}
# List of targets.
list(
# airquality dataset in base R:
tar_target(raw_data_file, "raw_data.csv", format = "file"),
tar_target(data, read_clean(raw_data_file)),
tar_target(fit, fit_model(data)),
tar_target(hist, create_plot(data))
)
targets inspects your code and constructs a dependency graph.
# R console
tar_visnetwork()
< !--html_preserve-->
< !-- The dots in the src are not ideal, but they get me to the path of the widget -->
< !--/html_preserve-->
tar_make() runs the correct targets in the correct order.
# R console
tar_make()
#> ● run target raw_data_file
#> ● run target data
#> ● run target fit
#> ● run target hist
#> ● end pipeline
Your can store the results in the _targets/ folder (default) or Amazon S3 buckets.
Either way, loading data back into R is the same.
# R console
tar_read(hist) # see also tar_load()
< !--html_preserve-->
< !--/html_preserve-->
Up-to-date targets do not rerun, which saves countless hours in computationally intense fields like machine learning, Bayesian statistics, and statistical genomics.
# R console
tar_make()
#> ✓ skip target raw_data_file
#> ✓ skip target data
#> ✓ skip target fit
#> ✓ skip target hist
#> ✓ skip pipeline
To help workflows scale, targets adopts the classical, pedantic, function-oriented perspective of the R language.10
Nearly everything that happens in R results from a function call. Therefore, basic programming centers on creating and refining functions.
— John Chambers
The more often you write your own functions, the nicer your experience becomes.
< !--html_preserve-->
I’m thinking about why this exists only in R and it may be because: 1) R’s functional approach makes it easier to detect dependencies, and 2) R’s uses lazy evaluation
I tried building a little prototype equivalent in Julia and I think it’s possible, but above my skill level
The best way to write fewer functions is to write less code.
To write less code, we need abstraction and automation.
Target factories are package functions that return lists of pre-configured target objects, and they make specialized pipelines reusable.
# script inside example.package
#' @export
read_clean <- function(path) {
path %>%
read_csv(col_types = cols()) %>%
mutate(Ozone = replace_na(Ozone, mean(Ozone, na.rm = TRUE)))
}
#' @export
fit_model <- function(data) {
biglm(Ozone ~ Wind + Temp, data)
}
#' @export
create_plot <- function(data) {
ggplot(data) +
geom_histogram(aes(x = Ozone), bins = 12) +
theme_gray(24)
}
#' @title Example target factory.
#' @description Concise shorthand to express our example pipeline.
#' @details
#' Target factories should use `tar_target_raw()`.
#' `tar_target()` is for users, and `tar_target_raw()` is for developers.
#' The former quotes its arguments, while the latter evaluates them.
#' @export
biglm_factory <- function(file) {
list(
tar_target_raw("raw_data_file", as.expression(file), format = "file"),
tar_target_raw("data", quote(example.package::read_clean(raw_data_file))),
tar_target_raw("fit", quote(example.package::fit_model(data))),
tar_target_raw("hist", quote(example.package::create_plot(data)))
)
}
With the factory above, our long _targets.R file suddenly collapses down to three lines.
# _targets.R for simulation-based calibration to validate a Stan model.
library(targets)
library(stantargets)
generate_data <- function() {
true_beta <- stats::rnorm(n = 1, mean = 0, sd = 1)
x <- seq(from = -1, to = 1, length.out = n)
y <- stats::rnorm(n, x * true_beta, 1)
list(n = n, x = x, y = y, true_beta = true_beta)
}
list(
tar_stan_mcmc_rep_summary(
model,
"model.stan", # We assume you already have a Stan model file.
generate_data(), # Runs once per rep.
batches = 25, # Batching reduces per-target overhead.
reps = 40, # Number of simulation reps per batch.
data_copy = "true_beta",
variables = "beta",
summaries = list(
~posterior::quantile2(.x, probs = c(0.025, 0.975))
)
)
)
# R console
tar_visnetwork()
< !--html_preserve-->
< !-- The dots in the src are not ideal, but they get me to the path of the widget -->
< !--/html_preserve-->
< !--html_preserve-->
Volunteers drive the rOpenSci review process, and each review is an act of altruism.
This was especially true for targets because of COVID-19, the overlap with the holidays, and the unusually copious workload.
Despite the obstacles, everyone delivered incredible feedback that substantially improved targets and its documentation.
Sam Oliver and TJ Mahr served as reviewers, and Mauro Lepore served as editor.
Sam inspired a section on getting started, an overview vignette, more debugging advice, and a new tar_branches() function to show branch provenance.
TJ suggested a new chapter on functions, helped me contrast the two styles of branching, and raised interesting questions about target names.
Mauro was continuously diligent, responsive, thoughtful, and conscientious as he mediated the review process and ensured a successful outcome.
My colleague Richard Payne was a serious drake user, and he built a proprietary drake_plan() generator for our team.
His package was the major inspiration for target factories and the R Targetopia.
Stallman, R. (1998). GNU Make, Version 3.77. Free Software Foundation. ISBN: 1882114809 ↩︎
Landau, W. M., (2021). The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 6(57), 2959, https://doi.org/10.21105/joss.02959↩︎
Landau, W. M. (2018). The drake R package: a pipeline toolkit for reproducibility and high-performance computing. Journal of Open Source Software, 3(21), 550. https://doi.org/10.21105/joss.00550↩︎
Rich FitzJohn (2021). remake: Make-like build management. R package version 0.3.0. ↩︎
JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.6.4. URL https://rmarkdown.rstudio.com↩︎
Stan Development Team (2012). Stan: a C++ library for probability and sampling. https://mc-stan.org↩︎
Cook, Samantha R., Andrew Gelman, and Donald B. Rubin. 2006. “Validation of Software for Bayesian Models Using Posterior Quantiles.” Journal of Computational and Graphical Statistics 15 (3): 675–92. http://www.jstor.org/stable/27594203↩︎
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2020. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration.” http://arxiv.org/abs/1804.06788↩︎
Daniel Falbel and Javier Luraschi (2020). torch: Tensors and Neural Networks with ‘GPU’ Acceleration. R package version 0.2.0. https://CRAN.R-project.org/package=torch↩︎