HPC for biological research
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In early May I had the opportunity to attend a workshop on using high performance computing in R hosted at Nimbios. I’ve been meaning to write a summary of the meeting ever since but got sidetracked by various other projects. Since a collaborator recently asked for meeting notes I finally took the time to write this post.
The meeting was jointly organized by folks from Nimbios and the remote data analysis and visualization group (rDAV). The idea behind the workshop was to introduce biologists dealing with big-data problems to a variety of analytical (mostly just R) and visualization tools (R and a few other open-source tools). The presentations were either technical (HPC resources, tools, demos) or application oriented.
Of the technical talks (HPC intro, utilities), the one I found most valuable from the workshop was by Pragnesh Patel from rDAV who did an excellent job outlining all the ins and outs of running R on a cluster. Slides from his talk are available here. A more recent summary from his UseR! 2011 presentation is available here.
On the application side, there were a couple of talks from Nimbios scientists. One by Michael Gilchrist on Evolutionary bioinformatics (pdf of slides) and the other by a Nimbios postdoc, William Godsoe on using hpc to build species distribution models [cite]10.1093/sysbio/syq005[/cite].
In addition to R, we also discussed other open-source tools for visualizing large datasets.
- Scientific visualization using VisIT [pdf] – A tutorial for using Visit. There is also a wiki.
- Paraview – Another visualization tool. Both tools can take advantage of HPC resources.
- A tutorial on R visualization – This wasn’t HPC specific. Mostly just examples on how to use ggplot2.
Although I wrote a detailed post on how to use Amazon’s EC2 cloud for HPC, this workshop convinced me to use resources that NSF already provides. Teragrid is a portal that provides access to numerous cluster resources funded by NSF or one of its partners. Using their XSEDE portal (which has replaced POPS), academics can request an allocation for computing time. For new and exploratory projects, there are ‘starter grants’ where one can get a rather generous allocation within a fairly short time. Larger allocations involve a review process. If the efforts you currently seek time for are being actively funded, the review process is likely to move through faster since it has already been favorably reviewed. Amazon’s computing cluster is still a useful service but there is no need to spend grant money elsewhere when NSF already provides these resources. As more scientists use and acknowledge Teragrid’s resources in their publications, that will provide the incentive and justification for organizations like rDav to continue seeking funding, especially in todays budgetary climate.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.