Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Brogramming is the art of looking good while you write code. Inverse brogramming is a silly term that I’m trying to coin for the opposite, but more important, concept: the art of writing good looking code.
At useR2013 I gave a talk on inverse brogramming in R – for those of you who weren’t there but live in North West England, I’m repeating the talk at the Manchester R User Group on 8th August. For everyone else, here a very quick rundown of the ideas.
With modern data analysis, you really have two jobs: being a statistician and being a programmer. This is especially true with R, where pointing and clicking is mostly eschewed in favour of scripting. If you come from a statistics background, then it’s very easy to focus on just the stats to the detriment of learning any programming skills.
The thing is though, software developers have spent decades figuring out how to make writing code easier, so there are lots of tips and tricks that can make your life easier.
My software dev bible is Steve McConnell’s Code Complete. Read it! It will change your life.
The minor downside it that, although very readable, it’s about 850 pages, so it takes some getting through.
The good news it that there are a couple of simple things that you can do that I think have the highest productivity-to-effort ratio.
Firstly, use a style guide. This is just a set of rules that explains what your code should look like. If your code has a consistent style, then it becomes much easier to read code that you wrote last year. If your whole team has a common style, then it becomes much easier to collaborate. Style helps you scale projects to more programmers.
There’s a style guide over the page, here.
Secondly, treat functions as a black box. You shouldn’t need to examine the source code to understand what a function does. Ideally, a function’s signature should clearly tell you what it does. The signature is the name of the function and its inputs. (Technically it includes the output as well, but that’s difficult to determine programmatically in R.)
The sig
package helps you determine whether your code lives up to this ideal.
The sig
function prints a function signature.
library(sig) sig(read.csv) ## read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec ## = ".", fill = TRUE, comment.char = "", ...)
So far, so unexciting. The args
and formals
functions do much the same thing. (In fact the sig
function is just a wrapper to formals
with a pretty print
method.)
It gets more interesting, when you look at the signatures of lots of functions together. list_sigs
prints all the sigs from a file or environment.
list_sigs(pkg2env(tools)) ## add_datalist <- function(pkgpath, force = FALSE) ## ## bibstyle <- function(style, envir, ..., .init = FALSE, .default = ## TRUE) ## ## buildVignettes <- function(package, dir, lib.loc = NULL, quiet = ## TRUE, clean = TRUE, tangle = FALSE) ## ## check_packages_in_dir <- function(dir, check_args = character(), ## check_args_db = list(), reverse = NULL, ## check_env = character(), xvfb = FALSE, Ncpus = ## getOption("Ncpus", 1), clean = TRUE, ...)
Even from just the first four signatures, you can see that the tools
package has a style problem. add_datalist
and check_packages_in_dir
are lower_under_case, buildVignettes
is lowerCamelCase, and bibstyle
is plain lowercase.
I don’t want to pick on the tools
package – it was written by lots of people over a long time period, and S compatibility was a priority for some parts, but you really don’t want to write your own code like this.
write_sigs
is a variant of list_sigs
that writes your sigs to a file. Here’s a game for you: print out the signatures from a package of yours and give them to a colleague. Then ask them to guess what the functions do. If they can’t guess, then you need to rethink your naming strategy.
There are two more simple metrics to identify dodgy functions. If functions have a lot of input arguments, it means that they are more complicated for users to understand. If functions are very long, then they are harder to maintain. (Would you rather hunt for a bug in a 500 line function or a 5 line function?)
The sig
package also contains a sig_report
function that identifies problem functions. This example uses the Hmisc
package because it contains many awful mega-functions that desperately need refactoring into smaller pieces.
sig_report( pkg2env(Hmisc), too_many_args = 25, too_many_lines = 200 ) ## The environment contains 509 variables of which 504 are functions. ## Distribution of the number of input arguments to the functions: ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 4 62 117 90 41 41 22 17 14 17 15 10 8 4 6 3 1 ## 17 18 19 20 21 22 23 24 27 28 30 33 35 48 66 ## 4 5 3 2 4 2 2 2 2 1 1 1 1 1 1 ## These functions have more than 25 input args: ## [1] dotchart2 event.chart ## [3] labcurve latex.default ## [5] latex.summary.formula.reverse panel.xYplot ## [7] rlegend transcan ## Distribution of the number of lines of the functions: ## 1 2 [3,4] [5,8] [9,16] [17,32] ## 1 47 15 57 98 108 ## [33,64] [65,128] [129,256] [257,512] [513,1024] ## 81 58 30 8 1 ## These functions have more than 200 lines: ## [1] areg aregImpute ## [3] event.chart event.history ## [5] format.df labcurve ## [7] latex.default panel.xYplot ## [9] plot.curveRep plot.summary.formula.reverse ## [11] print.char.list rcspline.plot ## [13] redun rlegend ## [15] rm.boot sas.get ## [17] summary.formula transcan
Tagged: brogramming, code, inverse brogramming, r, sig, style
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.