Site icon R-bloggers

Assessing the number of mapped reads in several bam files

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently I have started to organize my commonly used functions related to quality assessment and analyzing RNAseq data into an R package. It is called rnatoolbox and it is available here. In this post I introduce getMappedReadsCount(), i.e. a function that can be used for checking the number of aligned/mapped fragments in several bam files and detecting the outliers. The outliers are the bam files with oddly high (i.e. exceeding1.5 times the interquartile) and oddly low (i.e. lower than 1.5 times the interquartile) number of mapped fragments.


The getMappedReadsCount() function is especially useful when our aim is to analyze bam files from several samples. It may be a good idea to filter out those samples that their bam files have very low mapped reads !

 In the example, I first construct a vector that has path to several bam files (from various paths on the server!) that I would like to analyze: 

library(rnatoolbox)
bam1<- scan(

    
“/proj/pehackma/ali/OBSCN_ExonUsage/encode/bamOnlyFiles.txt”, 
   
what=”character”, sep=’\n’)
bam2<- list.files(

   
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
   
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam3<- list.files(

   
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
   
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam2<- bam2[-c(grep(“TX_”,bam2))]
bam4<- scan(

   
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles_05_10_2022.txt”,
   
what=”character”, sep=’\n’)
bam5<- scan(

   
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles2_05_10_2022.txt”,
   
what=”character”, sep=’\n’)
bam6<- list.files(path =

   
“/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
   
pattern = “.bam$”, recursive = TRUE, full.names = T)
bam7<- list.files(

   
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
   
pattern = “.bam$”, recursive = TRUE, full.names=T)
bam6<- bam6[c(grep(“TX_”,bam6))]
bam7<- bam7[c(grep(“TX_”,bam7))]


bamVec<- unique(c(bam1,bam2,bam3,bam4,bam5,bam6,bam7))

Next, I define the name of the vector elements to include the sample names. This consists of extracting the file name from the full path of the files and removing everything after the “.”.

names(bamVec)<- gsub(“\\..*”,””,basename(bamVec))

Finally, I run the function to get the read counts and plot the values. Note that the chromosomes to be considered for the read/fragment counts is defined with the chr parameter. Here, chr1-22 and chrX and chrY are considered.

getReadCount( bamFiles= bamVec,
   
plotFile= “/proj/pehackma/ali/test/test_rnatoolbox/getMappedReadCount_res_3.png”,
          chr=paste(“chr”,c(1:22, “X”, “Y”), sep=””),

         
fileFormat=”png”,
          height=9, width=19, mar=c(24.1, 4.1, 1.1, 1.1), main=””,

         
cex=1.5)

# [1] 159088406 316574542 205082424 244803844 104856472 191436762 184803592
# [8] 197802126 122037534 160421014 122235780 181429548 199137268 161879702
# [15] 184806182 182604280 176479586 188499272 173812402 180758764 154386574
# [22] 184470344 175950728 178088586 169036670 176570622 194848566 228952806
# [29] 204706202 197660756 140237062 184285630 157409076 368553534 316282098
# [36] 252580714 204283090 449077092 334729570 320968158 260299694 253185212
# [43] 230665084 264365018 262947800 227354874 307361834 275600346 293804574
# [50] 278743518 220880234 329853936 266015994 280758612 219623626 234773624
# [57] 215822812 289486614 265109110 241848712 218349294 268142972 228934248
# [64] 251021610 238898076 220085138 251466112 235138264 296144082 233070926
# [71] 232363502 218421700 202184526 293047712 303007888 217222656 243148532
# [78] 281507178 163940324 152692722 197206004 118648912 163852344 185098294
# [85] 188690242 153280640

and the plot is as follows !

Similar to the default setting of boxplot, the horizontal gray lines show the interquartile ranges (i.e. the solid narrow lines), the 1.5 the interquartile range (i.e. the dashed lines) and the median (i.e. the solid thick line). The outlier sample with oddly high number of mapped read counts in shown with a red up-pointing triangle.



To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version