Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Great data science work should be reproducible. The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Team collaboration, project validation, and sustainable products presuppose the ability to reproduce work over time.
In my opinion, mastering just a handful of important tools will make reproducible work in R much easier for data scientists. R users should be familiar with version control, RStudio projects, and literate programming through R Markdown. Once these tools are mastered, the major remaining challenge is creating a reproducible environment.
An environment consists of all the dependencies required to enable your code to run correctly. This includes R itself, R packages, and system dependencies. As with many programming languages, it can be challenging to manage reproducible R environments. Common issues include:
- Code that used to run no longer runs, even though the code has not changed.
- Being afraid to upgrade or install a new package, because it might break your code or someone else’s.
- Typing
install.packages
in your environment doesn’t do anything, or doesn’t do the right thing.
These challenges can be addressed through a careful combination of tools and strategies. This post describes two use cases for reproducible environments:
- Safely upgrading packages
- Collaborating on a team
The sections below each cover a strategy to address the use case, and the necessary tools to implement each strategy. Additional use cases, strategies, and tools are presented at https://environments.rstudio.com. This website is a work in progress, but we look forward to your feedback.
Safely Upgrading Packages
Upgrading packages can be a risky affair. It is not difficult to find serious R users who have been in a situation where upgrading a package had unintended consequences. For example, the upgrade may have broken parts of their current code, or upgrading a package for one project accidentally broke the code in another project. A strategy for safely upgrading packages consists of three steps:
- Isolate a project
- Record the current dependencies
- Upgrade packages
The first step in this strategy ensures one project’s packages and upgrades
won’t interfere with any other projects. Isolating projects is accomplished by
creating per-project libraries. A tool that makes this easy is the new renv
package. Inside of your R project, simply use:
# inside the project directory renv::init()
The second step is to record the current dependencies. This step is critical
because it creates a safety net. If the package upgrade goes poorly, you’ll be
able to revert the changes and return to the record of the working state. Again,
the renv
package makes this process easy.
# record the current dependencies in a file called renv.lock renv::snapshot() # commit the lockfile alongside your code in version control # and use this function to view the history of your lockfile renv::history() # if an upgrade goes astray, revert the lockfile renv::revert(commit = "abc123") # and restore the previous environment renv::restore()
With an isolated project and a safety net in place, you can now proceed to
upgrade or add new packages, while remaining certain the current functional
environment is still reproducible. The pak
package can be used to install and upgrade
packages in an interactive environment:
# upgrade packages quickly and safely pak::pkg_install("ggplot2")
The safety net provided by the renv
package relies on access to older versions
of R packages. For public packages, CRAN provides these older versions in the
CRAN archive. Organizations can
use tools like RStudio Package
Manager to make multiple versions
of private packages available. The “snapshot and
restore” approach can also be used
to promote content to production. In
fact, this approach is exactly how RStudio
Connect and
shinyapps.io deploy thousands of R applications to
production each day!
Team Collaboration
A common challenge on teams is sharing and running code. One strategy that administrators and R users can adopt to facilitate collaboration is shared baselines. The basics of the strategy are simple:
- Administrators setup a common environment for R users by installing RStudio Server.
- On the server, administrators install multiple versions of R.
- Each version of R is tied to a frozen repository using a Rprofile.site file.
By using a frozen repository, either administrators or users can install packages while still being sure that everyone will get the same set of packages. A frozen repository also ensures that adding new packages won’t upgrade other shared packages as a side-effect. New packages and upgrades are offered to users over time through the addition of new versions of R.
Frozen repositories can be created by manually cloning CRAN, accessing a service like MRAN, or utilizing a supported product like RStudio Package Manager.
Adaptable Strategies
The prior sections presented specific strategies for creating reproducible
environments in two common cases. The same strategy may not be appropriate for
every organization, R user, or situation. If you’re a student reporting an
error to your professor, capturing your sessionInfo()
may be all you need. In
contrast, a statistician working on a clinical trial will need a robust
framework for recreating their environment. Reproducibility is not binary!
To help pick between strategies, we’ve developed a strategy map. By answering two questions, you can quickly identify where your team falls on this map and identify the nearest successful strategy. The two questions are represented on the x and y-axis of the map:
- Do I have any restrictions on what packages can be used?
- Who is responsible for managing installed packages?
For more information on picking and using these strategies, please visit https://environments.rstudio.com. By adopting a strategy for reproducible environments, R users, administrators, and teams can solve a number of important challenges. Ultimately, reproducible work adds credibility, creating a solid foundation for research, business applications, and production systems. We are excited to be working on tools to make reproducible work in R easy and fun. We look forward to your feedback, community discussions, and future posts.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.