Site icon R-bloggers

categoryCompare Paper Finally Out!

[This article was first published on Deciphering life: One bit at a time :: R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

categoryCompare Paper Finally Out!

I can finally say that the publication on my Bioconductor package categoryCompare is finally published in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This has been a long time coming, and I wanted to give some background on the inspiration and development of the method and software.

TL;DR

The software package has been in development in one form or another since 2010, released to Bioconductor in summer 2012, and the publication has bounced around and been revised since spring of 2013, and it is finally available to you. All of the supplementary data and methods are available as an R package on github. Version control using git was instrumental in getting this work out in a timely manner. There is still a bunch of work to do on the package.

If I did it again, I would:

Inspiration

In spring of 2010 I started as a PostDoc with Eric Rouchka. One of his collaborators, Jeff Petruska is interested in the process of collateral sprouting of neurons, especially as it compares to regeneration. Early in my PostDoc, Jeff wanted to do a gene-level comparison of his microarray data of collateral sprouting in skin compared to previously published studies with muscle.

Combing through the literature produced a number of genes differentially expressed in denervated muscle. However, when comparing with the genes resulting from the skin data, there was almost nothing in common and nothing that made sense from a functional standpoint. Skimming around the Bioconductor literature in GOStats, I was struck by Robert Gentleman's example of coloring Gene Ontology nodes by which data set they originated from. This is a very simple meta-analysis and visualization. When I tried it with the skin – muscle comparison, I got some very interesting (i.e. the Petruska group thought the results were very interpretable) results.

Note: At this point, because of the data sources (gene lists from publications with little to no original data), I was using the hypergeometric enrichment test in Category to determine significant GO terms from the two tissues.

V 0.0000001

I started developing this idea into a simple package (i.e. collection of R scripts), that was able to do at least GO term enrichment, and that could be hosted on our group webserver to enable others to make use of it. Visualization and interrogation used the imageMaps function in Robert Gentleman's original demonstration, however any number of data sets could now be compared.

categoryCompare Method – Summary

The basic method is to take gene (or really any annotated feature) lists from multiple experiments, and perform annotation (Gene Ontology Terms, KEGG Pathways, etc) enrichment (either hypergeometric or GSEA type) on each gene list, determine significant annotations from each list, and then examine which annotations come from which list.

Because this results in a lot of data to parse through, exploration of the results is facilitated by considering the annotations as a network of annotations related by the number of shared genes between them, and interacting with the networks in Cytoscape.

If you want to know more, check out the paper, or the vignette in the Bioconductor package

Others

As I developed this idea, I also started looking into other possible software implementations.

Bioconductor Package

About this time I realized that I wanted the package to fit into the Bioconductor ecosystem. This required a complete redesign and rewrite of the code, as I moved to actually used some sort of OOP model (S4 in this case), and creating an actual package.

The package was released to the wild in the fall of 2012, as part of Bioconductor v 2.10. Of course, this was the last Bioconductor release based on R 2.15, which brought some particular challenges in namespaces with the switch in R 3.0.0 the next year.

Graphviz to RCytoscape

The original code used the graphviz package to do layouts for visualization, but it was difficult to install, and did not work the way I wanted. Thankfully Paul Shannon had just developed the RCytoscape package the year previous (Shannon et al., 2013), and this enabled truly interactive visualization and passing data back and forth between R and Cytoscape.

My use of RCytoscape has actually lead to finding improvements in the package, and better use of it. I am also hoping to make much more use of RCytoscape and the igraph package and combining them in novel ways.

First Publication Attempt

Our first attempt at publication focussed on the original data that inspired the method, and comparing it with gene-gene comparisons directly. We submitted to BMC Bioinformatics, and the reviews were not favorable, and took forever to get back. We actually wondered if the reviewers had read the paper that we submitted. We gave up on this venue. BTW, our last three publications have been submitted there, and the publication process has gotten worse and worse with every submission there. I don't think our papers are getting worse over the years. I don't know if we will bother submitting to BMC Bioinformatics again.

Second Publication Attempt

The second attempt was submitting to the current home in the Bioinformatics and Computational Biology section of Frontiers in Genetics. This time, although the reviews were harsh (i.e. they were not immediately favorable), they were fair, and actually contained useful critique, and pointed to a way forward.

Unfortunately for me, revising the manuscript to address the reviewers criticisms meant a lot of work to construct theoretical examples, as well as a lot of thought in order to pare the manuscript down to make sure our primary message was clearly communicated.

Frontiers Interactive Review

I would like to note that the Frontiers interactive review system, with the ability to discuss individual points with the reviewers (still anonymously) really helped make it possible to determine which points were make or break, and discuss different ways to approach things. This was the best review experience I have ever had, and a large part of that I think was due to being able to interact with the reviewers directly, and not just through letters mediated by the editor.

I think it would be nice if Frontiers had an option for making the review history available if the authors and reviewers were agreeable to it.

Github package of supplementary materials

In the initial publication, I had included the set of scripts used for analysis, data files, results, etc. However, the amount of work required in the rewrite was so substantial, that I created an R package specifically for the analyses that went into the paper, with separate R markdown vignettes for each result type (hypothetical, two different experimental comparisons). This package has documents on how raw data was processed, as well as the semi-processed experimental data.

Publication specific branch

Due to specific changes to the categoryCompare software needed to address the reviewer comments, a publication specific branch of the development version was created (paper). This allowed me to quickly introduce code and features that the reviewers had asked for, without worrying about breaking the current development version that will be released in Bioconductor shortly.

To Do

As with most software projects, there is still plenty to be done.

Side Effects

A nice side effect of the package is that any annotation enrichment I do now, I almost always do it in the framework of categoryCompare, just because it is a lot easier to make sense of using the visualization capabilities and coupling with Cytoscape.

What Would I Do Differently?

Keeping in mind that I started this work almost 4 years ago, when I still didn't know any R, and had yet to be exposed to the reproducibility and open science movements, or knitr, or pandoc, here are some things that if I started today I would do differently (you can probably guess some of these from above):

To leave a comment for the author, please follow the link and comment on their blog: Deciphering life: One bit at a time :: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.