One app, three languages
[This article was first published on Odd Hypothesis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This past week at work I had the opportunity to code the same algorithm using each of the three scientific programming/scripting languages I’m familiar with:Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The list above is the order that the (re)-coding was done and serves as a beginning of an answer as to why I had|wanted to do such repetitive work.
Before getting into the details first the problem: Grab multiple optimized DNA sequences in a MS Excel workbook and format them as a FASTA text file for use with a webapp for rare codon analysis. Prior to seeking my help, users were manually copying the sequences (located in one cell across multiple sheets) into a MS Word document. This was fine for 2-5 sequences, but got seriously tedious and error prone for anything >10.
Lastly, this is all done in Windows (for added craziness).
Round 1: Matlab (aka the 500lb gorilla)
Out of the (very expensive) box it has MS Excel read/write capabilities via the functionsxlsfinfo
xlsread
xlswrite
Adding the (also expensive) Bioinformatics Toolbox gives FASTA file io via
fastaread
fastawrite
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function CodonOptOutputCompiler(xlsFile, outPrefix) | |
if nargin < 2, | |
fprintf('USAGE:\n'); | |
fprintf('CodonOptOutputCompiler <input xls> <output prefix>\n'); | |
fprintf('\tinput xls\tMS Excel spreadsheet output from Codon Optimizer\n'); | |
fprintf('\n'); | |
fprintf('\toutput prefix\tPrefix for output files.\n'); | |
fprintf('\t \tOutput is two files:\n'); | |
fprintf('\t \t\t<output prefix>.xls\n'); | |
fprintf('\t \t\t<output prefix>.fasta\n'); | |
return; | |
end | |
[status, sheets] = xlsfinfo(xlsFile); | |
if ~isempty(status), | |
resNames = sheets(~cellfun(@isempty, regexpi(sheets, '^>'))); | |
% preallocate the results storage array | |
out = cell(length(resNames), 2); | |
% get sheets that have output from the optimizer | |
% they start with '>' | |
for i = 1:length(resNames), | |
resName = resNames{i}; | |
% get the sequence name | |
id = regexpi(resName, '(?<id>[\w\s]+?)-DNA', 'names'); | |
id = id.id; | |
% get the sequence | |
[~,seq,~] = xlsread(xlsFile, resName, 'B6'); | |
% write the id and sequence to a compiled list | |
out(i, :) = {id, seq{1}}; | |
disp(out(i, :)); | |
end | |
% write the compiled list to an xl spreadsheet | |
[status, message] = xlswrite([outPrefix '.xls'], out, 'Sheet 1'); | |
if ~status, | |
disp(message); | |
end | |
% write the compiled list to a fasta format text file | |
fastaData = cell2struct(out, {'Header', 'Sequence'}, 2); | |
fastawrite([outPrefix, '.fasta'], fastaData); | |
else | |
% there was an error | |
disp(sheets) | |
end |
Using the Matlab Compiler (again, if you’ve paid for it) this distills nicely into a tidy command line executable.
The problems began (as usual) immediately after deployment.
First, in order to run any compiled Matlab code, users need to install the weighty (~400MB) Matlab Component Runtime (MCR), which in corporate IT-lockdown land is it’s own form of enjoyment.
Second, and a horrendous PITA if you ask me, the version of the MCR users need depends on the version of Matlab the code was compiled in. Worse, there is no backward compatibility. In this case, users needed version 7.16 of the runtime to correspond with my Matlab 2011b. However, the program that generated the sequences in the first place (also a Matlab compile job) was made with Matlab 2009a.
It was late in the afternoon, the IT guys were gone, and I didn’t want to have to deal with any craziness of conflicting runtimes.
Sorry Matlab, you suck.
Round 2: Python (parsel tongue any one?)
There’s a lot of hubub about how NumPy/SciPy + matplotlib in distributions like Python(X,Y) can totally replace Matlab – FOR FREE! I’ve yet to really delve into that module stack. For the task at hand, bare python has all that’s needed with a few modules (again ALL FREE)MS Excel spelunking:
xlrd
xlwt
FASTA file acrobatics (and much much more):
BioPython
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# compiles output from the Codon Optimizer into a spreadsheet and fasta formated file | |
import sys, re, xlrd, xlwt | |
from Bio import SeqIO | |
from Bio.Seq import Seq | |
from Bio.SeqRecord import SeqRecord | |
from Bio.Alphabet import IUPAC | |
if __name__ == '__main__': | |
argv = sys.argv | |
if len(sys.argv) < 2: | |
print "USAGE: python %s <input xls> <output prefix>" % argv[0] | |
print " <input xls> MS Excel spreadsheet output from Codon Optimizer\n" | |
print " <output prefix> Prefix for output files:\n" | |
print " .xls Stacked list of ids, sequences in a spreadsheet" | |
print " .fasta FASTA formated text file" | |
raise ValueError | |
else: | |
xlsfile = argv[1] | |
outpref = argv[2] | |
# get the input workbook | |
wbin = xlrd.open_workbook(xlsfile) | |
# prepare the output workbook | |
wbout= xlwt.Workbook() | |
shout= wbout.add_sheet('Optimized Genes') | |
sheets = list() | |
result = dict() | |
sequences = list() | |
for shn in wbin.sheet_names(): | |
# get valid sheet names | |
rex = re.search('^>(?P<id>[\w\s]+?)-DNA', shn) | |
if rex: | |
print shn | |
sho = wbin.sheet_by_name(shn) | |
# get the sequence | |
seq = sho.cell(rowx=5, colx=1).value # get value in cell B6 | |
seqRec = SeqRecord(Seq(seq, IUPAC.unambiguous_dna)) | |
seqRec.id = rex.group('id') | |
seqRec.description = shn.replace('>', '') | |
sequences.append(seqRec) | |
# write the sequence result | |
shout.write(len(sequences)-1,0, seqRec.id) | |
shout.write(len(sequences)-1,1, seq) | |
print "%d sequences found." % len(sequences) | |
wbout.save(outpref+'.xls') | |
print "written to %s" % outpref+'.xls' | |
# write the fasta output | |
fastaOut = open(outpref+'.fasta', 'w') | |
SeqIO.write(sequences, fastaOut, 'fasta') | |
fastaOut.close() | |
print "written to %s" % outpref+'.fasta' |
As an important note, the python code completes almost instantaneously whereas the Matlab code took at least 5 seconds (about 1 sec per worksheet in the source .xlsx file). I don’t know why Matlab takes so long to get/put data from/into an Excel workbook, but I sure hope the Mathworks engineers are working on it. Sure, this is slightly unfair since python is byte-compiled when run, but on a modern PC with 1GHz of multi-core processing power and 3GB of RAM, I expect performance, darn-it.
As a small slight to the Python camp, the xlrd/xlwt modules are only compatible with the older .xls (Microsoft Excel 97-2000) format files and not the newer XML based .xlsx files. So it does require one extra step … par for the course.
Compiling to an console executable is made easy with Py2Exe.
Deploying is a snap – zip and email everything in the ./dist folder of where you Py2Exe’d your source code.
Of course, getting users that were originally in happy point and click land to work at the C: prompt kinda stopped this awesome train of free and open source progress dead in its tracks.
Round 3: R (statistics will help you find that buried treasure, er p-value)
R directly works with MS Excel files? Yup:xlsx
FASTA files, Bioinformatics and statistics go hand-in-hand, this is pretty much a given:
seqinr
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# pile up sequences on multiple xlsx sheets from the codon optimizer and export | |
# as a single list of sequences on a single sheet | |
# as a fasta file of all sequence for condon finder | |
CodonOptOutputCompiler = function(xlsFile, outPrefix) { | |
require(xlsx) # for reading xlsx files | |
require(seqinr) # for fasta file IO | |
wbi = loadWorkbook(xlsFile) | |
sheets = getSheets(wbi) | |
# extract only sheets that have the right name format | |
sheets = sheets[grepl('(DNA|protein)', names(sheets))] | |
seq = vector('list', length=length(sheets)) | |
for (i in 1:length(sheets)) { | |
sh = sheets[[i]] | |
# the optimized sequence is in cell 'B6' | |
row = getRows(sh, rowIndex = 6) # get row 6 | |
cell= getCells(row, colIndex = 2) # get column 'B' | |
seq[[i]] = getCellValue(cell[[1]]) | |
} | |
names(seq) = gsub('>', '', names(sheets)) | |
# output the sequences to an xlsx file | |
write.xlsx(as.data.frame(do.call(rbind, seq)), sprintf('%s.xlsx', outPrefix), col.names=FALSE, row.names=TRUE) | |
# output the sequences to a fasta txt file | |
write.fasta(seq, names(seq), sprintf('%s.fasta', outPrefix)) | |
} |
As far as speed goes, the R code bested Matlab and was on par with Python. Pretty interesting and consistent with the benchmarks posted by the Julia team.
A small word of warning, the xlsx package uses the Apache POI java library in the background and does run into significant memory cloggery (at least on my workstation) when working with sheets heavily laden with rows and/or columns.
Compiling, well, I’m not completely sure it exists yet (although RCC looks interesting). Of course, who needs to compile if you can just drop this on your webserver behind rApache and a simple webform. There, user command-line aversion solved.
Wrap-up
So what did this exercise in redundancy teach me? Well, thanks to the plethora of open-source tools, there is more than one way to skin a software deployment cat. It also shows how completely originally niche platforms like Python and R have come to parity with juggernauts like Matlab. Last, it has me satisfied (for now) in my programming poly-glottery.To leave a comment for the author, please follow the link and comment on their blog: Odd Hypothesis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.