Site icon R-bloggers

Management of Research Data – a Shell+Python+Excel+R Approach

[This article was first published on manio » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am a computer science researcher, usually working on both Windows and Linux system. Windows is the place where I do the document work, like reading paper, browsing the internet, writing papers with LaTex… Linux is where I run and generate experimental results.

The Chaos

After years of messy data management and recent data chaos, I decided to sort things out. The motivation is when you are RESEARCHING, you tune the parameters and you look at the data, you tune the parameter again and you look the data, and you tune the parameters and you… Then you get lots of output from the experiments you ran. They have lots different parameters. In the later run you may have more parameters when you think you’d better tune more to see if there is any difference…

The Elements

When you do research, you have these on your desktop related to your experiments:

Suggestion

Driver script/driver application/platform
  1. Always make parameter changing easy. You don’t want to change a parameter and recompile the application every time you run.
  2. The commit numbers should be parameters of the formatted data. It is easy, just treat them as parameters and put them to raw result file by driver script.
    1. You can automate it by command like $git log |grep commit | head -n 1|cut -d ‘ ‘ -f 2.
  3. Using Linux environment variable to pass parameters to application or platform is a good method if passing by arguments is not possible.
    1. In driver script, use something like $export MYPARAMETER1=1000 to set the variable.
    2. In the application, get the variable by parameter1=getenv(“MYPARAMETER1″);
  4. Each run the of driver application is recommended generating one line in the out put.
  5. Each driver script is recommended generating only one raw data file. If you have more than one file, the parser needs to do more work.
  6. Always use shell script to automate your test. You don’t want to wait and check if this run has finished all the time and start a new run. You want to start multiple runs, go watch TV and come back later.
Raw data files
  1. Should have all the parameter, commits. The standard of a good raw data file is that you can figure out what exactly you the driver application and platform are.
  2. Different data files from same parameters should have something different, say, job ID.
Result parser
  1. Use Python to parse data. Python is easier and more powerful than shell.
  2. The result parser should ALWAYS generate formatted data files in the standard R format, which can be imported to R directly without any human effort. The formatted data files should have header (name for each column).
  3. Always have a result parser. You don’t want to type the result file name and grep different results manually every time. You want to type one command and everything come out.
Formatted data file
  1. Consider formatted data files as temporary. Put them to Excel immediately as they are generated.
Excel data file
  1. Have one sheet in Excel as a Dictionary.
  2. If the meaning of a word in the dictionary has changed, use a new word in the future.
  3. Let the Excel file be the hub of all experiment data. Consider all that data temporary (and they are). In the Excel file, one sheet has results from one driver script run. In the sheet, you should have the first a few lines reserved for data explanations, graph and plotting script of that graph.
Misc
  1. Name any useful file accordingly and carefully. Put the file name as a word and explain it in the dictionary if possible.
  2. Save all the useful files to a centerized sheet in Excel. You can insert any file by Insert Object.
Tips:

Use a good output format in the executable. If you have a bad output format, every time you modify some output, you have to modify the output parser script. If your output is already formatted as:

Headline …… HEADERMARKER
Dataline …… DATAMARKER
it is much easier for you to get the data your want.
Experience:

To leave a comment for the author, please follow the link and comment on their blog: manio » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.