[This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you ever write code for scientific computing (chances are you do if you’re here), stop what you’re doing and spend 8 minutes reading this open-access paper:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Wilson et al. Best Practices for Scientific Computing. arXiv:1210.0530 (2012). (Direct link to PDF).
The paper makes a number of good points regarding software as a tool just like any other lab equipment: it should be built, validated, and used as carefully as any other physical instrumentation. Yet most scientists who write software are self-taught, and haven’t been properly trained in fundamental software development skills.
The paper outlines ten practices every computational biologist should adopt when writing code for research computing. Most of these are the usual suspects that you’d probably guess – using version control, workflow management, writing good documentation, modularizing code into functions, unit testing, agile development, etc. One that particularly jumped out at me was the recommendation to document design and purpose, not mechanics.
We all know that good comments and documentation is critical for code reproducibility and maintenance, but inline documentation that recapitulates the code is hardly useful. Instead, we should aim to document the underlying ideas, interface, and reasons, not the implementation.
For example, the following commentary is hardly useful:
# Increment the variable “i” by one.
i = i+1< !----->< !----->
The real recommendation here is that if your code requires such substantial documentation of the actual implementation to be understandable, it’s better to spend the time rewriting the code rather than writing a lengthy description of what it does. I’m very guilty of doing this with R code, nesting multiple levels of functions and vector operations:
# It would take a paragraph to explain what this is doing.
# Better to break up into multiple lines of code.
sapply(data.frame(n=sapply(x, function(d) sum(is.na(d)))), function(dd) mean(dd))
It would take much more time to properly document what this is doing than it would take to split the operation into manageable chunks over multiple lines such that the code no longer needs an explanation. We’re not playing code golf here – using fewer lines doesn’t make you a better programmer.
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.