Get a “treat”: A template for reproducible research with R

An Accounting and Data Science Nerd's Corner

1 year ago

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Open Science Data Center of TRR 266 has the objective to facilitate the use of open science methods in the area of accounting. One lesson that we learned over the last year is that many researchers, while generally being very positive towards the principles of open science, struggle to get their projects into shape so that they can share it with others.

Thus, we developed a TRR reproducible emprical accounting research template (treat). This repository, while predominantly being targeted at the team members of our research network, provides a structured platform for reproducible R-based research projects in general. To make it more accessible to everybody who is new to R, we also “produced” a short video series that shows you how to set up your local computing environment and to reproduce the toy analysis contained in the repository. Based on this, you should be able to build your own research projects in a reproducible way.¹

Step 1: Install R

Every empirical project needs to use a statistical programming language. While there are great commercial alternatives, if you want to make your work reproducible, you should consider using a software environment that is freely available for everybody. For this, Julia, Python and R come to mind. We will be using R here. Installing R itself is simple. Go to https://cloud.r-project.org and download and install the current R binaries for your platform. Do not create any menu items or task bar entries. You won’t need them as you will be using R via RStudio. Video

Step 2: Install RStudio

RStudio is an Integreated Developement Environment (IDE) centered on R. It is one of the main reasons why we are such strong advocates of using R for reproducible research. While it is R centered, it also offers everything you need to bridge to other programming languages and tools, most notably, Python. Installing RStudio is again, straightforward: Go to https://rstudio.com/products/rstudio/download/, download and install the free Desktop version. This time around, I suggest adding a task bar entry. You will need it. And the icon looks nice. Video

Step 3: Install Rtools

While not strictly required, we are firm believers that you need a stable set of cross-platform development tools to make your projects reproducible. And let’s face it, the Unix development toolchain (which is the essence of Rtools) is the standard of cross-platform development. Again, installing Rtools essentially means going to the CRAN web page (https://cran.r-project.org/bin/windows/Rtools/) and downloading the according file. After installation, make sure that you add it to your PATH environment variable. While you could put it in your main path, it seems wise to limit this to R sessions so that the tools do not interact with others that you might already have available. As suggested on the download page linked above, run this snippet in the R console (the thing on the left side of RStudio):

writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron")`

It creates a hidden .Renviron file in your home directory adding the relevant Rtools directory to the path. After restarting your R session (Session / Restart R from within RStudio), the toolchain should be available. You can verify this by opening up a Terminal (Tab on the left side of RStudio) and enter which make. It should now point you to the Rtools directory. Video

You only need to install Rtools when you work in a Windows environment. Linux users should have most of the necessary tools available by default and can install missing bits with their package manager. Mac OS users should install the command line tools from Xtools by running xcode-select --install from within a terminal.

Step 4: Install Git

Software projects tend to spread out over multiple files that evolve over time. You will be trying things and decide at some point that you rather return to a previous state. Sometimes you will have one version of your code “in production”, meaning for us researchers used for a publicly available paper while you keep working on new iterations. And, finally and maybe most importantly, often you collaborate with others on a project and you work on it concurrently.

All these points make it useful to use a version control system. ‘Git’ has developed to the de facto standard in recent years. It makes it easy to maintain local code repositories, to clone code from public repositories like ‘GitHub’, to collaborate with others over remote repositories and to make your code publicly available.

Installing Git for Windows essentially means going to the web page https://git-scm.com/download/win and installing the relevant binary (Video). While it is fine to accept all suggested defaults, personally I would not install the Explorer addin as you will be using Git from within RStudio or the command line most of the time.

After installing Git you need to configure it. To the minimum this means using the Terminal to set your user name and password as follows

git config --global user.name "John Doe"
git config --global user.email johndoe@example.com

See the first-time install section on the Pro Git Book for further information. If you already have a GitHub account (you do not need one for this tutorial), you should link it to your Git installation. See here for details.

Step 5: Clone the ‘treat’ GitHub repo into an RStudio project

After we installed the necessary tools, we can now download and explore the ‘treat’ template repository. While there are several ways of doing that, probably the easiest is to ‘clone’ the repository. For this, you do not need to have a GitHub account. Instead you create a local copy of the repository on your hard disk and set up a local repository. You can always decide to set up a remote repository for it at a later time.

The video shows you how it is being done. The essential steps are

Start a new project on RStudio (File / New Project)
Choose ‘Version Control’
Enter the URL of the ‘treat’ repository: https://github.com/trr266/treat

After you are done cloning the repository, take a minute to familiarize yourself with the README. It explains the folder structure and discusses the next steps that you need to take to run the analysis. You will see that we already can cross point #1 from the five point todo list (cloning the repo). Great!

Step 6: Install R packages and TinyTex

R is build around a relatively lightweight set of base functionality and very extensive and versatile package environments. This implies that for just about every project, you will be using packages that need to be installed by everybody who wants to reproduce your code. When you prepare your code, it seems tempting to simply install the packages for the user but we advice against it. After all, when installing something, you are interfering the computing environment of the user. We believe that this is a decision that a user needs to take. Thus, we included a little snippet of R code in the README of the repository that install the packages. You can simply copy and paste the code into the R console to install the packages. On a fresh R install, this will take a while as packages again have over packages as dependencies causing almost an “avalanche of package installs”. See the video to see how it unfolds.

To make sure that you do not have to install packages that are already installed on your system we use the following small wrapper function that checks the installed packages first:

install_package_if_missing <- function(pkg) {
  if (! pkg %in% installed.packages()[, "Package"]) install.packages(pkg)
}

After installing the packages it might be a good time to install a LaTeX environment if you do not already have one installed. For this purpose the lightweight TinyTex distribution is extremely convenient. It can be installed from within R, along with its accommodating package by two lines of code:

install.packages('tinytex')
tinytex::install_tinytex()

See the video for all steps in combination.

Step 7: Build the project

Now that we completed point #2 of the project README we can move on to actually run the analysis (video). For that, you first have to enter your WRDS credentials in the _config.csv file and save the according file as config.csv. When designing research projects for reproducibility, you might have certain inputs that will depend on the user environment. While often you can code around them (e.g., by using relative and not absolute paths or by providing code for different platforms), sometimes you need to customize the code to the user. We think that in this case it is a good idea to store all the data that needs configuration by the user in one dedicated place instead of spreading it out across various code files.

Please note: You should never commit sensitive data (like your WRDS password) to a repository, even if it is a local one. The risk is that at a later point you might decide to take the repository public and that then also makes your passwords public. This is why we included config.csv in the .gitignore file. The .gitignore file specifies which files should ignored when committing.

After storing your WRDS credentials in the config.csv file, you can run the analysis either by typing “make all” in the terminal residing in the project root or by hitting “Build all” in the build tab of RStudio (upper right quadrant). This will execute several code files in a specific order and ultimately produce three output files in the output directory:

results.rda: An R data file containing the sample used in the final analysis and some R objects containing the figures, tables and regression models that are used in the paper and or the slide deck
paper.pdf: A paper stub containing some text, figures and tables.
presentation.pdf: A slide deck using LaTeX beamer document class and our new beamer template (also stored in the repo)

The building process uses GNU ‘make’ for producing the output files. ‘make’ is a very versatile tool to create ‘targets’ based on ‘rules’ and ‘prerequisites’. It is agnostic of the programming language being used. Instead it simply describes the dependencies that taken together create the output of a project in a Makefile. To understand the syntax of a Makefile, this might be a good starting point. For the ‘treat’ repository, the dependencies can be described as follows:²

We have two targets: A paper and a slide deck.
Both get produced by calling rmarkdown::render()
Both depend on their respective source codes and the results.rda file.
results.rda is produced by sourcing the do_analysis.R code.
The do_analysis.R code depends on the data being present in the file acc_sample.rds
The data in acc_sample.rds are being generated by running prepare_data.R
The prepare_data.R code depends on WRDS data being present in the file cstat_us_sample.rds
The cstat_us_sample.rds file is generated by running pull_wrds_data.R
The code file pull_wrds_data.R finally is depending on config.csv being present.

The big advantage of setting up a Makefile to reflect all these dependencies is that it will make sure that, whenever you change something in the code base, all but also only the parts that need to be rebuild will be rebuild next time you run make. For example, when you change the treatment of extreme observations when estimating the discretionary accruals in the file prepare_data.R, next time you run make all only prepare_data.R, do_analyis.R and the render processes will be re-run but the WRDS data will not be pulled again. Neat!

And now?

If you made this far, congratulations! You have successfully set up your computing environment for R and reproduced a small empirical accounting research project. A useful next step could be to modify the code to familiarize yourself with the make process and the code structure. Then you can use the ‘treat’ repository for your own projects. Finally, you could fork the repo, modify it to fit your needs and help us by issuing pull requests, improving the repository and potentially extending it to contain more statistical programming languages or approaches.

Enjoy!

For ease of exposition, the following and the videos assume a Windows environment. Most of it, however, also applies to Mac OS and Linux derivatives.↩︎
I gloss over some details here. Feel free to take a look at the Makefile if you are interested.↩︎

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.