Site icon R-bloggers

Software carpentry

[This article was first published on Marginally significant » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I would never call myself a programmer, but as an ecologists I manage moderately big and complicated datasets, and that require to interact with my computer to get the most of them. I self-taught most of the things I need to do and more or less I succeeded on managing my data (MySQL), write simple functions to analyze it (R) or use other people functions (written in Matlab or java for which I have no knowledge at all). That’s nothing fancy. I don’t create software or develop complex simulations. But I still need to communicate with my Mac. Knowing some basic programming is for me the difference between painful weeks of pain and tears vs. a couple of hours performing an easy task. That’s why I sign up for a software carpentry boot camp.

What I learnt?

– The Shell: I rarely interact with the shell, but I had to do it in three occasions in the past. It was always a guesswork. What I needed to do is simply copy and paste some code I find in the internet to run a script written by other scientist tweaked for my data. The tweaking part was usually ok as the specific task I performed came with some help from the authors, but opening the shell and figure out where to start calling the program, or how the path structure works, and all this minor stuff was a shot in the dark. Now i learned some of this basics (pwd ls cd mkdir cp mv cat sort | ) and while I will probably not use them much, next time I need to open the terminal (to run pandoc maybe?) I will know how to start. A cool thing is that you can easily run R scripts*:

Rscript file.R input.txt

or shell scrips:

bash file.sh

* the R script can load the input using:

### Read in command line arguments
args <- commandArgs(trailingOnly = TRUE)
### Read in data from file set in the command line
data <- read.table(args[1],sep=",")

– Regular expressions: I also used regular expressions a couple of times before, but is not in my everyday toolbox. Seeing some examples on how they can be used was a good reminder that I am still doing some tasks in a suboptimal way. (go to www.regexpal.com to play with them) I liked the:

alias grep="grep --color=auto"

to put some color in my terminal by default.

– Git: I am using Git (well, trying to, at least) since a year ago or so. Git is not a hipster caprice, but a solid and tested system to keep track of what you do. However, it can be a bit complicated at the beginning. I was using a GUI so far, but in SWC they show me how to set up a folder and do the usual tasks from the shell. I have to say that while I like the GUI for seeing the history, is way easier to setup a new project from the command line (git init). I also understand now better the philosophy of Git, and why staging (git add) is useful to separate files in different commits (git commit), for example. I also learnt how the gitignore file works. Just list in a txt the files that shouldn’t be tracked with regexp:

*.pdf
data_*.csv

– Unit tests: My friend @vgaltes (a real software dev.) is always pushing me to adopt unit testing, but is very obscure for me on how to do that in ecology, where my code is most times is ephemeral and quick. I usually test moderately complicated functions with fake datasets constructed on the fly to see if they behave as expected. This checks that it is currently doing what you think it is but not tells you if it will always behave this way in other situations. Ethan White advice was to scale up this approach, so that I can save the fake data and run it every time I change the code to see if it still works. Those are regression tests (or vector tests) according to Titus Brown (Read his post, it makes you think). A second step I am not actually ding is Stupidity Driven Testing, (basically test for known bugs). I need to see how I adopt that but having bugs in your code is easier than it looks like, so the more safety controls you have the better. Paraphrasing Titus:

“If you’re confident your code works, you’re probably wrong. And that should worry you.” 

More important is probably learning from seeing other people coding, one of my main problems is that i don’t work closely with more experienced users that often, so i can not learn from imitation. For example, I sucks at respecting name conventions, commenting enough and simplifying complex lines in several simpler lines, but I’ll try to get better at that. Overall I feel as if I started using more complex software directly by running (sometimes in a clumsy way) and this course teach me how to walk properly. Thanks @ethanwhite and @bendmorris!


To leave a comment for the author, please follow the link and comment on their blog: Marginally significant » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.