Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
My formal training in computer programming consists of two R programming labs required by my first statistics classes, and some JavaScript and database training. That’s about it. Most of my programming knowledge is self-taught.1 For a researcher who does a lot of programming but doesn’t consider programming to be the job, that’s fine… up to a point.
While I understand the languages I need well enough, I don’t know much about programming best practices2. This goes from function naming to code organization, along with all the tools others created to manage projects (git, make, ctabs, etc.). For short scripts and blog posts, this is fine. Even for a research paper where you’re using tools rather than making new ones, this is okay. But when projects start to get big and highly innovative, my lack of knowledge of programming practices starts to bite me in the butt.
I program with R most of the time, and I’m smart enough to program defensively, writing generalizable functions with a reasonable amount of parameterization and that accept other functions as inputs, thus helping compartmentalize my code and allowing easy changing of parameters. But there’s a lot more I can learn, and I have read articles such as
- Jon Zelner “Reproducibility start at the home” series
- Chris von Csefalvay’s “The Ten Rules for Defensive Programming in R”
- Robert M. Flight’s series
Not surprisingly there is seemingly contradictory advice. This blog post summarizes this advice and ends with a plea for help for what to do.
My Approach So Far (What NOT To Do)
I started the current project I’m working on in early 2016 (merciful heavens, has it been that long?). My advisor didn’t tell me where it was going; he seemed to pitch it as something to work on over the winter break. But it turned into my becoming one of his collaborators (with a former Ph.D. student of his, now a professor at the University of Waterloo), and my taking charge of all things programming (that is, simulations and applications) for a research paper introducing a new statistic for change point analysis (you can read my earlier post) where I mentioned and introduced the topic for the first time).
To anyone out there wondering why academics write such terrible code, let me break it down for you:
- Academics often are not trained programmers. They learned programming on their own, enough to solve their research problems.
- Academic code often was produced during research. My understanding of professional programming is that often a plan with a project coordinator exists, along with documents coordinating it. While I’m new to research, I don’t think research works in that manner.
- It’s hard to plan for research. Research breaks into new territory, without there being an end goal, since we don’t necessarily know where it will end.3
As the project grew in scope the code I would write acquired features like a boat hull acquires barnacles. Here’s a rough description of how my project is structured (brace yourself):
- The file
ChangepointCommon.r
contains nearly every important function and variable in my project, save for functions that are used for drawing the final PDF plots of the analysis. Not every file uses everything fromChangepointCommon.r
, but it is called viasource()
frequently. This file has a sister file,ChangepointCommon.cpp
, for holding the C++ code that underlies some R functions. - A file called
powerSimulations2.r
is a script that performs all (Monte Carlo) simulations. These simulations are extensive, and I perform them on my school’s supercomputer to take advantage of its 60+ cores and 1TB ram. They simulate our test statistic against multiple similar statistics in a variety of contexts, for the sake of producing power curves at various sample sizes. This script is a derivative ofpowerSimulations.r
, which did similar work but while assuming that the long-run variance of the data was known. - Somewhere there is a file that simulated our test statistic and showed that the statistic would converge in distribution to some random variable under a variety of different contexts. I don’t know where this file went, but it’s somewhere. At least I saved the data.
- But the file that plots the results of these simulations is
dist_conv_plots.r
. I guess it makes plots, but if I remember right this file is effectively deprecated by another file… or maybe I just haven’t needed to make those plots for a long time. I guess I don’t remember why this exists. - There’s a file
lrv_est_analysis_parallel.r
that effectively does simulations that show that long-run variance estimation is hard, examining the performance of these estimators, which are needed for our test statistics. Again, due to how much simulation I want to do, this is meant to be run on the department’s supercomputer. By the way, none of the files I mentioned above can be run directly from the command line; I’ve been usingsource()
to run them. powerSimulations2.r
creates.rda
files that contain simulation results; these files need to be combined together, and null-hypothesis rejection rates need to be computed. I used to do this with a file calledmisc1.R
that effectively saved commands I was doing by hand when this job was simple, but then the job soon became very involved with all those files so it turned into an abomination that I hated with a passion. It was just two days ago that I wrote functions that did the workmisc1.R
did and added those functions toChangepointCommon.r
, then wrote an executable script,powerSimStatDataGenerator.R
, that accepted CSV files containing metadata (what files to use, what the corresponding statistical methods are, how to work with those methods) and used those files to generate a data file that would be used later.- Data files for particular applications are scattered everywhere and I just have to search for them when I want to work with them. We’ve changed applications but the executable
BankTestPvalComputeEW.R
works with our most recent (and highlighted) application, taking in two CSV files containing data and doing statistical analyses with them, then spitting the results of those analyses into an.rda
(or is it.Rda
?) file. - Finally, the script
texPlotCreation.R
takes all these analyses and draws pictures from them. This file includes functions not included inChangepointCommon.R
that are used for generating PDF plots. The plots are saved in a folder calledPaperPDFPlots
, which I recently archived since we redid all our pictures using a different method but I want to keep the pictures made the old way, yet they keep the same file names. - There is no folder hierarchy. This is all stored in a directory called
ChangepointResearch
. There are 149 files in that directory, of different conventions; to be fair to myself, though, a lot of them were created by LaTeX. There is, of course, thepowerSimulations_old
directory where I saved the old version of the simulations I did, and thepowerSimulations_paper
directory where the most recent version are kept. Also, there’sPaperPDFPlots
where all the plots were stored; the directory has 280 files, all of them plots. - Finally there’s a directory called
Notebook
. This is a directory containing the attempt of writing abookdown
book that would serve as a research notebook. It’s filled with.Rmd
files that contain code I was writing, along with commentary.
In short, I never organized my project well. I remember once needing to sift through someone else’s half-finished project at an old job, and hating the creator of those directories so much. If someone else attempted to recreate my project without me around I bet they would hate me too. I sometimes hate myself when I need to make revisions, as I finished doing a few days ago.
I need to learn how to not let this happen again in the future, and how to properly organize a project–even if it seems like it’s small. Sometimes small projects turn into big ones. In fact, that’s exactly what happened here.
I think part of the rean this project turned out messy was because I learned what literate programming and R Markdown is around the time I started, and I took the ideas too far. I tried to do everything in .Rnw
(and then .Rmd
) files. While R Markdown and Sweave are great tools, and writing code with the documentation is great, one may go too far with it. First, sharing code among documents without copy/paste (which is bad) is difficult to do. Second, in the wrong hands, one can think there’s not much need for organization since this is a one-off document.
Of course, the naming conventions are as inconsistent as possible. I documented functions, and if you read them you’d see that documentation and even commenting conventions varied wildly. Clearly I have no style guide, and no linting or checking that inputs are as expected and on and on and on, a zoo of bad practice. These issues seem the more tractable to resolve:
- Pick a naming convention and stick with it. Do not mix them. Perhaps consider writing your own style guide.
- Use functions whenever you can, and keep them short.
- Use packrat to manage dependencies to keep things consistent and reproducible.
- roxygen1 is the standard for function documentation in R.
- Versioning systems like
git
keep you from holding onto old versions out of fear of needing them in the future. Use versioning software.
Aside from these, though, I’m torn between two general approaches to project management, which I describe below.
Projects As Executables
I think Jon Zelner’s posts were the first posts I read about how one may want to organize a project for both reproducibility and ease of management. Mr. Zelner suggests approaching a project like how software developers approach an application, but instead of the final product being an executable, the final product is a paper. Some of the tips Mr. Zelner provides includes:
- Treat R scripts as executables. That is, your scripts should be executable from the command line after placing a shebang at the beginning and running
chmod +x script.R
. They should accept command-line arguments. (I manage this using optparse, though he recommends other tools.) - Organize your scripts with a standard directory structure (keep all scripts in a directory like
/R
, all data in/data
, all figures in/fig
, etc.), and create amake
file that describes their relationships. Themake
file keeps the project up-to-date, making sure that all dependencies between scripts and data files and figures are managed and repercussions from changes are fed forward appropriately. - Use
knitr
for report writing, but the bulk of the work is done elsewhere. - Manage reproducibility issues, like dependency issues, using packages that I don’t want to use. I may look to packrat for this.
The projects-as-executables idea resonated with me and I’ve worked to write fewer .Rmd
files and more executable .R
files. You’ll notice that many files in my project are executable; this is an attempt at implementing Mr. Zelner’s ideas. The entire GNU/Linux development pipeline is available to you for your project, and there are many tools meant to make a coder’s life easier (like make
).
Months ago, when the reviewers of our paper said they wanted more revisions and the responsibility for those revisions fell squarely on me, I thought of fully implementing Mr. Zelner’s ideas, with make
files and all. But the make
file intimidated me when I looked at the web of files and their dependencies, including the 200+ PDF files whose names I don’t even know, or all the data files containing power simulations. Sadly, only a couple days ago I realized how to work around that problem (spoiler: it’s not really a problem).
Of course, this would require breaking up a file like ChangepointCommon.r
. A file shouldn’t include everything. There should be a separate file for statistical functions, a file for functions relevant for simulations, and a file for functions that transform data, a file for functions that make plots, and so on. This is important to make sure that dependencies are appropriately handled by make
; you don’t want the exhaustive simulations redone because a function that makes plots was changed. ChangepointCommon.r
has been acting like a couple days ago when I thought of an answer to that problem. ChangepointCommon.r
has been acting like a poor man’s package, and that’s not a good design.
Projects as Packages
That last sentence serves as a good segue to the other idea for managing a research project; Chris Csefalvay and Robert Flight’s suggestion to write packages for research projects. Csefalvay and Flight suggest that projects should be handled as packages. Everything is a part of a package. Functions are functions in a package. Analyses are vignettes of a package. Data are included in the /data
directory of the package. Again: everything is a package.
Packages are meant to be distributed, so viewing a project as a package means preparing your project to be distributed. This can help keep you honest when programming; you’re less likely to take shortcuts when you plan on releasing it to the world. Furthermore, others can start using the tools you made and that can earn you citations or even an additional paper in J. Stat. Soft. or the R Journal. And as Csefalvay and Flight point out, when you approach projects as package development, you get access to the universe of tools meant to aid package development. The documentation you write will be more accessible to you, right from the command line.
My concern is that not everything I want to do seems like it belongs in a package or vignette. Particularly, the intensive Monte Carlo simulations don’t belong in a package. Vignettes don’t play well with the paper that my collaborators actually write or the format the journal wants.
I could get the benefit of Csefalvay and Flight’s approach and Mr. Zelner’s approach by: putting the functions and resulting data sets in a package; putting simple analyses, demonstrations, and de facto unit tests in package vignettes; and putting the rest in project directories and files that follow Mr. Zelner’s approach. But if the functions in the highly-unstable package are modified, how will make
know to redo the relevant analyses?
I don’t know anything about package development, so I could be overcomplicating the issue. Of course, if the work is highly innovative, like developing a new, never-before-seen statistical method (like I’m doing), then a package will eventually be needed. Perhaps its best to lean on that method anyway for that fact alone.
Conclusion
I would like to hear others’ thought on this. Like I said at the beginning, I don’t know much about good practices. If there’s any tips, I’d like to hear them. If there’s a way to maximize the benefits of both Mr. Zelner’s and Csefalvay and Flight’s approaches, I’d especially like to hear about it. It may be too late for this project, but I’ll want to keep these ideas in mind for the future to keep myself from making the same mistakes.
I have created a video course published by Packt Publishing entitled Training Your Systems with Python Statistical Modelling, the third volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
- I did take classes from the School of Computing at the University of Utah, enough to earn a Certificate in Big Data, but they assumed programming knowledge; they didn’t teach how to program. That’s not the same thing. ↩
- Another problem is that I haven’t been taught to algorithmically analyse my code. Thus, sometimes I write super slow code that. ↩
- That said, while you can’t necessarily anticipate the end structure, you could take steps to minimize how much redesigning will need to be done when the unexpected eventually surfaces; this is important to programming defensively. ↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.