[This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges(or IRanges if you don’t need to compare intervals with an associated chromosome and strand). foverlaps is from the data.tablepackage and is inspired by findOvelaps.
In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.
In the foverlaps function Y has to be indexed using the setkey function (we don’t have to do it on X). The key is intended to speed-up finding overlaps.
Which one is faster?
To check this we used the benchmark function from the rbenchmarkpackage. It’s a simple wrapper of the system.time function.
The code below plots the execution time of both functions for increasing numbers of rows of data set X.
Interestingly, foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.
We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.
Information about my R session:
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] data.table_1.9.4 rbenchmark_1.0.0 GenomicRanges_1.18.4
[4] GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0
[7] BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.5 reshape2_1.4.1 stringr_0.6.2
[6] tools_3.1.3 XVector_0.6.0
To leave a comment for the author, please follow the link and comment on their blog: Recipes, scripts and genomics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.