Site icon R-bloggers

Using R: writing a table with odd lines (GFF track headers)

[This article was first published on There is grandeur in this view of life » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other day, I wanted to add track lines to a GFF file, so that I could view different features as separate custom tracks in a genome browser. The need to shuffle genome coordinates between different file formats seems to occur all the time when you deal with some kind of bioinformatic data. It’s usually just text files; one just has to keep track of whether the positions should start on 0 or 1 and whether the end should include the last base or not . . .

> head(gff)

  seqname         source        feature     start       end score strand
1       5 protein_coding           mRNA 169010747 169031776     .      +
2       5 protein_coding        protein 169015421 169021641     .      +
3       5 protein_coding five_prime_UTR 169010747 169010893     .      +
4       5 protein_coding five_prime_UTR 169015398 169015420     .      +
5       5 protein_coding            CDS 169015421 169015579     .      +
6       5 protein_coding            CDS 169018052 169018228     .      +
  frame                                                     group
1     . ID=ENST00000504258;Name=CCDC99-005;Parent=ENSG00000040275
2     . ID=ENSP00000421249;Name=CCDC99-005;Parent=ENST00000504258
3     .                                    Parent=ENST00000504258
4     .                                    Parent=ENST00000504258
5     0                    Name=CDS:CCDC99;Parent=ENST00000504258
6     0                    Name=CDS:CCDC99;Parent=ENST00000504258

The above example consists of a few lines from the Ensembl human database, not the actual tracks I was interested in. Anyway, this is what I did: instead of using write.table() directly, explicitly open a file for writing, first write some track line, then write the relevant subset, and repeat.

tracks <- unique(gff$feature)
connection <- file("separate_tracks.gff", "w")
for (k in 1:length(tracks)) {
  writeLines(paste("track name=", tracks[k], sep=""), connection)
  write.table(subset(gff, feature==tracks[k]),
              sep="\t", row.names=F, col.names=F,
              quote=F, file=connection)
}
close(connection)

Postat i:med mera Tagged: gff, R

To leave a comment for the author, please follow the link and comment on their blog: There is grandeur in this view of life » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.