Site icon R-bloggers

R’s xtabs for total weighted read coverage

[This article was first published on Jermdemo Raised to the Law, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Samtools and its BioPerl wrapper Bio::DB:Sam prefer to give read coverage on a depth per base pair basis. This is typically an array of depths, one for every position that has at least one read aligned. OK, works for me. But how can we quickly see which targets (in my case transcripts) have the greatest total weighted read coverage (i.e. sum every base pair of every read that aligned)?

My solution is to take that target-pos-depth information and import a table into R with at least the following columns:
targetName
depth

I added the pos column here to emphasize the base-pair granularity
         tx pos depth
1   tx500090 227     1
2   tx500090 228     1
3   tx500090 229     1
4   tx500090 230     1
5   tx500090 231     1
...
66  tx500123 184     1
67  tx500123 185     1
68  tx500123 186     1
69  tx500123 187     2
70  tx500123 188     2
71  tx500123 189     2

In R
myCoverage<-read.table("myCoverage.txt",header=TRUE)
myxTab< -xtabs(depth ~ tx,data=myCoverage)

xtabs will sum up depth-weighted positions by default (i suppose this is what tabulated contigency really means) and return an unsorted list of transcripts and their weighted coverage (total base pair read coverage)
> myxTab[1:100]
tx
tx500090 tx500123 tx500134 tx500155 tx500170 tx500178 tx500203 tx500207
      38       92      610       46      176       46       92      130
tx500273 tx500441 tx500481 tx500482 tx500501 tx500507 tx500667 tx500684
      76     2390      114    71228      762      222      542      442
tx500945 tx500955 tx501016 tx501120 tx501127 tx501169 tx501190 tx501192
    1378     3604       46       46      420      854      130      352
tx501206 tx501226 tx501229 tx501245 tx501270 tx501297 tx501390 tx501405
     244     1204      206    15926      214       46      168       46
tx501406 tx501438 tx501504 tx501694 tx501702 tx501877 tx501902 tx502238
      38     2572     7768     3274      314      298       84      198
tx502320 tx502364 tx502403 tx502414 tx502462 tx502515 tx502517 tx502519
     122       38      588       46       46       38       38      466
tx502610 tx502624 tx502680 tx502841 tx502882 tx503090 tx503192 tx503204
     206       38      168     3750       38      122       76       92
tx503416 tx503468 tx503523 tx503536 tx503571 tx503578 tx503623 tx503700
     260       38      168       38       46       46       84       38
tx503720 tx503721 tx503722 tx503788 tx503872 tx503892 tx503930 tx503970
   97112       38       38     4708       38       38     1290       84
tx503995 tx504107 tx504115 tx504346 tx504353 tx504355 tx504357 tx504398
      46      152      206       46     3416     1402      122      290
tx504434 tx504483 tx504523 tx504589 tx504612 tx504711 tx504751 tx504827
     290     8728      176       46       46       76     5644     1308
tx504828 tx504834 tx504882 tx504931 tx504952 tx505017 tx505029 tx505078
    2336      328       46    34138     1000     1838       46      474
tx505123 tx505146 tx505159 tx505184
      38   123344      160      588

this is approximately 10000x faster than using a formula like:
by(myCoverage,myCoverage$tx,function(x){sum(x$depth)}

To leave a comment for the author, please follow the link and comment on their blog: Jermdemo Raised to the Law.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.