Using an R ‘template’ package to enhance reproducible research or the ‘R package syndrome’
Motivation
Have you ever had the feeling that the creation of your data analysis report(s) resulted in looking up, copy-pasting and reuse of code from previous analyses?
This approach is time consuming and prone to errors. If you frequently analyze similar data(-types), e.g. from a standardized analysis workflow or different experiments on the same platform, the automation of your report creation via an R ‘template’ package might be a very useful and time-saving step. It also allows to focus on the important part of the analysis (i.e. the experiment or analysis-specific part).
Imagine that you need to analyze tens or hundreds of runs of data in the same format, making use of an R ‘template’ package can save you hours, days or even weeks. On the go, reports can be adjusted, errors corrected and extensions added without much effort.
A bit of history
The reproducibility of an analysis workflow remains one complex challenge for R data scientists. The creation of dynamic and reproducible analysis report in R started to spread with the creation of Sweave [1] by Friederich Leisch, enabling the creation of mixed R and LaTeX reports. This was further expanded with the knitr package [2] by Yihui Xie in 2012. The concept of child documents and parametrized (rmarkdown) reports is another step towards this goal [3]. Our approach integrates the efforts towards reproducible research by means of report templates integrated within an R package structure, namely the ‘R template’ package.
Advantages of an R ‘template’ package
Integrating report templates in an R package structure enables to take advantage of the functionalities of an R package. R functions and report templates, wrapping the entire analysis workflow can be contained in the same code structure. Bug fixes, extensions of an analysis can be tracked via the package versioning. Because your entire analysis workflow is contained within a unique container, the code for the creation of the analysis report can easily be exchanged with collaborators and a new version installed smoothly in a standard manner. Successive modifications of the reports can easily be applied to previous analysis reports.
How do you start the development and/or writing of your own R ‘templates’?
Distinguish the data(experiment)-specific (e.g. different models, visualization) from the frequently used analysis parts. The latter one contains the parts of the analysis that are consistent throughout the different analyses or experiments (i.e. the code that you were copy-pasting from previous analyses). Think of e.g. the reporting of results from a standard analysis or modeling technique like linear discriminant analysis. The modeling function, visualization and results formatting is the same irrespective of the specific experiment at hand. The experiment-specific part consists of e.g. the input data, model or analysis parameters, etc. The code which is consistent throughout the different analyses can be captured in R functions or the equivalent R ‘template’ documents for reporting. The R ‘template’ documents can indeed be seen as the equivalent of an R function for reporting purposes, integrated within an R package, with input parameters and potentially some default values (i.e. the most common value). Each part of an analysis report can be contained in separated modular child document(s), integrated within an R package structure. This can be either one ‘template’ document or several (nested) ‘template’ documents, depending on personal preference regarding structuring (e.g. one document per analysis technique) as well as complexity of the analysis at hand. An advantage of using a more complex nested approach is that it allows to include or exclude several parts of the analysis depending on requests of the user and/or the data at hand and/or the findings within the analysis. This so called template document(s) can be run from outside the package, and for data/parameters specific of a certain analysis/experiment.
Our suggested approach
The approach that we recommend makes use of a main ‘master’ ‘template’ which calls on its turn the (different) child document(s) contained in the R ‘template’ package (see figure). This allows the developers to easily extend the ‘template’ without running into issues with previous reports.
It is advisable and user-friendly to create a start template which mentions the required and optional input parameters necessary for the downstream analysis. A dedicated function can be created to extract the template(s) path (after the package installation), which can be used to render/run the document.
To enhance the approach:
- progress messages can be implemented in the template to follow the report creation
- default parameters can be set at the start of each template
- session information can be included at the end of the main template to track the versions of the R package and its dependencies.
How to use an R ‘template’ package?
The template(s) contained in the R package can be either called directly in R. There is the option of specifying the necessary parameters for the report directly inside the document or to pass them via the render call which provides a cleaner R environment. The latter option might be quite complicated, depending on the analysis at hand. The report creation can also be combined with a Shiny user interface. This has the advantage that even non-(expert) R-users can easily create a reproducible report without specifying anything in R at all. (Although it might be that your report does not lend itself to a shiny interface.) Suppose your data are in a standardized format (e.g. several runs of data with the same type of information). In this case a shiny application can be developed with a list of possible runs, some user-specified additional parameters and a button to create and/or download the report. In this way even users without any R knowledge can create their own standardized report in a straightforward way from the same R template package.
Summary of the advantages
To summarize, using an R template package makes the approach less prone to errors, you save time and the reports are consistent across different analyses. It is easy to keep track of changes, extensions and bug fixes, via appropriate package versioning which ensures the reproducibility of an entire analysis workflow. The combination of an R ‘template’ package with a shiny application creates opportunities for non(-expert) R users to create their own report.
Is the development of an R ‘template’ worthwhile the effort?
The development of such template reports might seem cumbersome and time-consuming; and the use of the R package structure for such purposes overkill, a.k.a the ‘R package syndrome’. However a lot of time and analysis reproducibility can be gained when using this package afterwards, which makes the effort worthwhile.
All the tools are already available thanks to the open R community, so do-it-yourself!
Our presentation at the UseR2017 conference is available here. An example of an R ‘template’ package with a shiny interface is available here. Feel free to contact us for more information: Laure Cougnaud (laure.cougnaud.at.openanalytics.eu), Kirsten Van Hoorde (kirsten.vanhoorde.at.openanalytics.eu).
[2] https://cran.r-project.org/web/packages/knitr/index.html.
[3] http://rmarkdown.rstudio.com/developer_parameterized_reports.html.