Downloading DNA sequences into R
[This article was first published on The Praise of Insects, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while ago, a friend of mine needed to download a number of different DNA sequences from Genbank, the online repository for the vast majority of DNA sequences read from all organisms by labs all over the world. This is not a problem. The “ape” package in R has a nifty function, read.GenBank(), that downloads the sequences identified by the accession numbers given to the function into a DNAbin object. Thus, read.GenBank(“AY883003”) downloads the sequence AY8833003, the internal transcribed spacer 2 gene for Anthonomus grandis, the cotton boll weevil. read.GenBank() is able to read a vector of accession numbers, making easy to download a lot of sequences if you’re willing to give it the time.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
All well and good. Unfortunately, the base function returns only the accession number as the name of the sequence. My friend was downloading sequences of many different genes from several different species. Understandably, mere accession numbers are not particularly helpful in this situation, and more information is helpful for processing datasets such as this. Thankfully, a quick hack of the function ensured that species and gene region info could be downloaded with the sequences, solving the problem. It also extended the function’s utility significantly and in my opinion is now much more useful for phylogenetics-type work.
The resulting function is read.GB(). It currently reads the “ORGANISM”, “DEFINITION”, and “ACCESSION” fields of Genbank files which record the information regarding species identity, gene region and accession number respectively. These are stored in the resulting DNAbin object as an attribute, and can be returned in the following manner:
a<-read.GB("AY883003")
attr(a, “species”)
attr(a, “gene”)
attr(a, “accession_num”)
The current default names for the sequences are returned in a standard format: accession number|scientific name.
Full credit goes to Emmanuel Paradis who wrote the original function, and who wrote it in such a way that it was fairly painless to extend it in the manner above.
To leave a comment for the author, please follow the link and comment on their blog: The Praise of Insects.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.