A Short Introduction to Bioconductor
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Peter Hickey (@PeteHaitch)
One of the keys to R's success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.
Bioconductor is an open source, open development software project that focuses on providing tools for the analysis of high-throughput genomic data, an area of research known variously as bioinformatics or computational biology. Examples of these data are sequencing the DNA of human genomes or measuring the level of expression of genes in hundreds of tumours. Recent advances in technology mean that such data are a central part of modern biological research, be it medical, agricultural, or basic science.
The Bioconductor project began in 2001 and initiated by Robert Gentleman, one of the originators of the R language. Nowadays there is a core team of nine developers, led by Martin Morgan, who develop some of the important core packages and maintain the infrastructure of the project. As with CRAN, it is the user-contributed packages that make the Bioconductor project the valuable resource that it is. There are more than 1000 software packages in the most recent Bioconductor release. In addition to these packages, Bioconductor includes more than 900 annotation packages and 200 experiment data packages. Annotation packages help streamline the oft-tedious bookkeeping and annotation of data associated with bioinformatics research while the experiment data packages contain processed data and are a valuable teaching resource.
Since its establishment, two of the main goals of the Bioconductor project have been reproducible research and high-quality documentation. In support of these aims, Bioconductor releases packages under a biannual schedule, which is tied to the most recent 'release' version of R, and each Bioconductor software package must contain a vignette. Each vignette is a document that provides a task-oriented description of package functionality, more like a book chapter than the technical and often terse function-level documentation accessible via ?
or help()
at the R console. Some of these vignettes, such as the User's Guide that accompanies the limma
package (pdf), include multiple case studies and carefully explain the statistical foundations of the methods implemented in the package. There is also a dedicated support forum containing many years worth of questions on common problems with answers from experts in the field.
Bioconductor has recently begun publishing separate “workflows”, along with teaching materials used in courses and conferences to help users learn how to analyse high-throughput biological data. These are excellent resources for those wishing to learn more about what is available in Bioconductor and how to get the most from the project. The website also hosts detailed instructions on installing Bioconductor on your local machine or trying it out a preconfigured setup using Amazon Machine Images or Docker images.
The teaching resources have been further bolstered by material from the recent Bioconductor meeting, held in Seattle, USA on July 21-22. This annual meeting is a great mix of basic science and data analysis methodology talks, presentations on interesting Bioconductor packages, and afternoon workshops where you can learn from the developers themselves. All the workshop materials, and most of the slides from the presentations, can be found here. The meeting was preceded by Developer Day, a less formal get-together including talks and brainstorming sessions about the current state and future directions of the Bioconductor project. There is also an annual European Bioconductor meeting and, for the first time, an Asia-Pacific Bioconductor Developer's Meeting and workshop, to be held as part of GIW/InCoB 2015 in Tokyo, Japan on September 8-11.
With its more specialised focus than CRAN, Bioconductor strongly encourages that package developers make use of the excellent infrastructure provided by existing Bioconductor packages. The intention is to reduce the number of times the wheel is re-invented, as well increasing the interoperability of objects and methods from different packages. The source code of these core packages can make useful reading for R developers, particularly those wishing to learn more about the S4 object oriented system. This source code can be accessed using Subversion or via the GitHub mirror of all Bioconductor packages.
Bioconductor has been, and continues to be, an incredibly useful resource for people analysing high-throughput genomic data. The development and maintainence of the project is a considerable undertaking, and there is a great debt owed to those who established the project, brought it into being, and continue its day-to-day running. But just as important is the community of users and developers. It is this community of users and developers that sees such a project succeed and be exciting to be a part of.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.