Fast(ish) extraction of exon locations from a BED12 file using data.table

altuna

11 years ago

[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others 🙂 Anyway, a BED12 file typically has locations of genomic features (those features are usually genes or transcripts if the format is BED12 ). 11th and 12th columns of the BED12 file contains information about exon sizes and exon starts, see here for more info on BED format. To be able to use that information one has to employ string functions since exon sizes and starts are concatenated to each other with commas. A straightforward way to do is to use strplit() and apply() to extract numbers located in strings, but it takes too much time if you have a large number of rows.

By using data.table I could get locations of exons for all human Ensembl transcripts in 5-6 seconds, while using apply() and strsplit() on a regular data.frame takes about 30 seconds ( see the EDIT below, this is probably because I used apply in an inefficient way, it may not be needed at all ). The function returns a data.table object which then can be transformed into IRanges or GenomicRanges objects easily. Doing the string operations and iteration in C or C++ may decrease this time even further.

EDIT: I’m not sure why this is particularly faster than a data.frame, I guess I used apply in an idiotic way on the BED data.frame. I applied strsplit row-by-row on 11th column of each row, rather than using strsplit directly on 11th column of the data.frame. Anyway, it was a good exercise but a bit useless 🙂

bed12.to.exons<-function(bed.file,skip=0)
{
  require(data.table)
  colClasses=c("factor","integer","integer","factor","integer","factor","integer","integer","integer","integer","factor","factor") 
  ref   =read.table(bed.file,header=F,skip=skip,colClasses = colClasses) # give colClasses for fast read-in
  ref.dt=data.table(ref)
 
  # apply strsplit on columns 11 and 12 to extract exon start positions and exon sizes
  b.start.size=ref.dt[, cbind(as.integer(unlist(strsplit(as.character(V12),"," ))),as.integer(unlist(strsplit(as.character(V11),"," ))))]
  rep.ref.dt=ref.dt[rep(1:nrow(ref.dt),ref[,10]),] # replicate rows occurs as many as its exon number
 
  rep.ref.dt$V3=rep.ref.dt$V2+b.start.size[,1]+b.start.size[,2] # calculate exon start and ends
  rep.ref.dt$V2=rep.ref.dt$V2+b.start.size[,1]+1
 
  return(rep.ref.dt[,1:6,with=F])
} 
 
res=bed12.to.exons(bed.file) 
system.time(bed12.to.exons(bed.file))
 
 
# a line from a bed file, just to show how it looks like
>read.table(bed.file,header=F)[1,]
      V1        V2        V3              V4 V5 V6        V7        V8 V9 V10      V11
1 chr1 223340009 223340009 ENST00000328556  0  + 223340009 223351568  0   2 105,126,
         V12
1 0,11433,

To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.