Site icon R-bloggers

How Should I Organize My R Research Projects?

[This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My formal training in computer programming consists of two R programming labs required by my first statistics classes, and some JavaScript and database training. That’s about it. Most of my programming knowledge is self-taught.1 For a researcher who does a lot of programming but doesn’t consider programming to be the job, that’s fine… up to a point.

While I understand the languages I need well enough, I don’t know much about programming best practices2. This goes from function naming to code organization, along with all the tools others created to manage projects (git, make, ctabs, etc.). For short scripts and blog posts, this is fine. Even for a research paper where you’re using tools rather than making new ones, this is okay. But when projects start to get big and highly innovative, my lack of knowledge of programming practices starts to bite me in the butt.

I program with R most of the time, and I’m smart enough to program defensively, writing generalizable functions with a reasonable amount of parameterization and that accept other functions as inputs, thus helping compartmentalize my code and allowing easy changing of parameters. But there’s a lot more I can learn, and I have read articles such as

Not surprisingly there is seemingly contradictory advice. This blog post summarizes this advice and ends with a plea for help for what to do.

My Approach So Far (What NOT To Do)

I started the current project I’m working on in early 2016 (merciful heavens, has it been that long?). My advisor didn’t tell me where it was going; he seemed to pitch it as something to work on over the winter break. But it turned into my becoming one of his collaborators (with a former Ph.D. student of his, now a professor at the University of Waterloo), and my taking charge of all things programming (that is, simulations and applications) for a research paper introducing a new statistic for change point analysis (you can read my earlier post) where I mentioned and introduced the topic for the first time).

To anyone out there wondering why academics write such terrible code, let me break it down for you:

  1. Academics often are not trained programmers. They learned programming on their own, enough to solve their research problems.
  2. Academic code often was produced during research. My understanding of professional programming is that often a plan with a project coordinator exists, along with documents coordinating it. While I’m new to research, I don’t think research works in that manner.
  3. It’s hard to plan for research. Research breaks into new territory, without there being an end goal, since we don’t necessarily know where it will end.3

As the project grew in scope the code I would write acquired features like a boat hull acquires barnacles. Here’s a rough description of how my project is structured (brace yourself):

In short, I never organized my project well. I remember once needing to sift through someone else’s half-finished project at an old job, and hating the creator of those directories so much. If someone else attempted to recreate my project without me around I bet they would hate me too. I sometimes hate myself when I need to make revisions, as I finished doing a few days ago.

I need to learn how to not let this happen again in the future, and how to properly organize a project–even if it seems like it’s small. Sometimes small projects turn into big ones. In fact, that’s exactly what happened here.

I think part of the rean this project turned out messy was because I learned what literate programming and R Markdown is around the time I started, and I took the ideas too far. I tried to do everything in .Rnw (and then .Rmd) files. While R Markdown and Sweave are great tools, and writing code with the documentation is great, one may go too far with it. First, sharing code among documents without copy/paste (which is bad) is difficult to do. Second, in the wrong hands, one can think there’s not much need for organization since this is a one-off document.

Of course, the naming conventions are as inconsistent as possible. I documented functions, and if you read them you’d see that documentation and even commenting conventions varied wildly. Clearly I have no style guide, and no linting or checking that inputs are as expected and on and on and on, a zoo of bad practice. These issues seem the more tractable to resolve:

Aside from these, though, I’m torn between two general approaches to project management, which I describe below.

Projects As Executables

I think Jon Zelner’s posts were the first posts I read about how one may want to organize a project for both reproducibility and ease of management. Mr. Zelner suggests approaching a project like how software developers approach an application, but instead of the final product being an executable, the final product is a paper. Some of the tips Mr. Zelner provides includes:

The projects-as-executables idea resonated with me and I’ve worked to write fewer .Rmd files and more executable .R files. You’ll notice that many files in my project are executable; this is an attempt at implementing Mr. Zelner’s ideas. The entire GNU/Linux development pipeline is available to you for your project, and there are many tools meant to make a coder’s life easier (like make).

Months ago, when the reviewers of our paper said they wanted more revisions and the responsibility for those revisions fell squarely on me, I thought of fully implementing Mr. Zelner’s ideas, with make files and all. But the make file intimidated me when I looked at the web of files and their dependencies, including the 200+ PDF files whose names I don’t even know, or all the data files containing power simulations. Sadly, only a couple days ago I realized how to work around that problem (spoiler: it’s not really a problem).

Of course, this would require breaking up a file like ChangepointCommon.r. A file shouldn’t include everything. There should be a separate file for statistical functions, a file for functions relevant for simulations, and a file for functions that transform data, a file for functions that make plots, and so on. This is important to make sure that dependencies are appropriately handled by make; you don’t want the exhaustive simulations redone because a function that makes plots was changed. ChangepointCommon.r has been acting like a couple days ago when I thought of an answer to that problem. ChangepointCommon.r has been acting like a poor man’s package, and that’s not a good design.

Projects as Packages

That last sentence serves as a good segue to the other idea for managing a research project; Chris Csefalvay and Robert Flight’s suggestion to write packages for research projects. Csefalvay and Flight suggest that projects should be handled as packages. Everything is a part of a package. Functions are functions in a package. Analyses are vignettes of a package. Data are included in the /data directory of the package. Again: everything is a package.

Packages are meant to be distributed, so viewing a project as a package means preparing your project to be distributed. This can help keep you honest when programming; you’re less likely to take shortcuts when you plan on releasing it to the world. Furthermore, others can start using the tools you made and that can earn you citations or even an additional paper in J. Stat. Soft. or the R Journal. And as Csefalvay and Flight point out, when you approach projects as package development, you get access to the universe of tools meant to aid package development. The documentation you write will be more accessible to you, right from the command line.

My concern is that not everything I want to do seems like it belongs in a package or vignette. Particularly, the intensive Monte Carlo simulations don’t belong in a package. Vignettes don’t play well with the paper that my collaborators actually write or the format the journal wants.

I could get the benefit of Csefalvay and Flight’s approach and Mr. Zelner’s approach by: putting the functions and resulting data sets in a package; putting simple analyses, demonstrations, and de facto unit tests in package vignettes; and putting the rest in project directories and files that follow Mr. Zelner’s approach. But if the functions in the highly-unstable package are modified, how will make know to redo the relevant analyses?

I don’t know anything about package development, so I could be overcomplicating the issue. Of course, if the work is highly innovative, like developing a new, never-before-seen statistical method (like I’m doing), then a package will eventually be needed. Perhaps its best to lean on that method anyway for that fact alone.

Conclusion

I would like to hear others’ thought on this. Like I said at the beginning, I don’t know much about good practices. If there’s any tips, I’d like to hear them. If there’s a way to maximize the benefits of both Mr. Zelner’s and Csefalvay and Flight’s approaches, I’d especially like to hear about it. It may be too late for this project, but I’ll want to keep these ideas in mind for the future to keep myself from making the same mistakes.


I have created a video course published by Packt Publishing entitled Training Your Systems with Python Statistical Modelling, the third volume in a four-volume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course discusses how to use Python for machine learning. The course covers classical statistical methods, supervised learning including classification and regression, clustering, dimensionality reduction, and more! The course is peppered with examples demonstrating the techniques and software on real-world data and visuals to explain the concepts presented. Viewers get a hands-on experience using Python for machine learning. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.


  1. I did take classes from the School of Computing at the University of Utah, enough to earn a Certificate in Big Data, but they assumed programming knowledge; they didn’t teach how to program. That’s not the same thing. 
  2. Another problem is that I haven’t been taught to algorithmically analyse my code. Thus, sometimes I write super slow code that. 
  3. That said, while you can’t necessarily anticipate the end structure, you could take steps to minimize how much redesigning will need to be done when the unexpected eventually surfaces; this is important to programming defensively. 

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.