Inferring the gender of the subjects from RNAseq BAM files

Di Cook

2 hours ago

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I show how getGender(), function can be used for inferring the gender of the studied subjects from their binary alignment bam files. The gender can be a source of unwanted variation within the data, for which you may want to adjust your differential gene expression or splicing analysis. However, complete metadata are unfortunately not always available. Furthermore, sometimes details within metadata are incorrect or have been misplaced due to manual error. Therefore, it is a good practice to quickly double check some details within the data to either complete the missing metadata information or to make sure that the prior stages have been performed without any accidental mix-ups.

ThegetGender() function can be run on a vector that contains the path to several bam files.

In the example, I first construct a vector that has the paths to several bam files (from various places on the server!) that I would like to analyze:

library(rnatoolbox)

bam1<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/encode/bamOnlyFiles.txt”,
what=”character”, sep=’\n’)
bam2<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam3<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = TRUE)
bam2<- bam2[-c(grep(“TX_”,bam2))]
bam4<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles_05_10_2022.txt”,
what=”character”, sep=’\n’)
bam5<-scan(
“/proj/pehackma/ali/OBSCN_ExonUsage/bamOnlyFiles2_05_10_2022.txt”,
what=”character”, sep=’\n’)
bam6<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names = T)
bam7<- list.files(
path = “/proj/pehackma/ali/OBSCN_ExonUsage/newDataAnalysis2/star_align/”,
pattern = “.bam$”, recursive = TRUE, full.names=T)
bam6<- bam6[c(grep(“TX_”,bam6))]
bam7<- bam7[c(grep(“TX_”,bam7))]

bamVec<- unique(c(bam1,bam2,bam3,bam4,bam5,bam6,bam7))
names(bamVec)<- gsub(“\\..*”,””,basename(bamVec))

The classification will be based on the number of the fragments mapping to chromosome Y (defined as the numChr parameter) vs the number of fragments mapping to chromosome Y (defined as the numChr parameter). I need to define a ratio cutoff. For now I assign 0.05. Later however I will show how a suitable cutoff can be chosen and why 0.05 makes sense for this data.

ratioCuftoff=0.05

(chrRatio<- getGender(bamFiles=bamVec,
          plotFile=”/proj/pehackma/ali/test/test_rnatoolbox/getGender_res_3.png”,
          height=600, width=1100, fileFormat=”png”,pointsize=14,
          numChr=”chrY”, denumChr=”chrX”,
          mar=c(24.1, 4.1, 1.1, 1.1),
          hLine=ratioCuftoff, main=””))

# [1] 0.01506109 0.07722695 0.01344776 0.01196316 0.02678638 0.06733287
# [7] 0.01923814 0.07279685 0.06990508 0.01842288 0.06224493 0.01863105
# [13] 0.02040690 0.01802919 0.01715347 0.01968033 0.01868028 0.03803660
# [19] 0.01974870 0.01619692 0.07069193 0.06678158 0.07593763 0.10010664
# [25] 0.07228590 0.06714948 0.07386076 0.06651051 0.06846690 0.06959353
# [31] 0.08471437 0.07470632 0.09894125 0.08619078 0.01249461 0.07840080
# [37] 0.06956370 0.01244372 0.07121727 0.06335510 0.02085689 0.10241148
# [43] 0.07971098 0.09502325 0.02094541 0.07242126 0.02400726 0.01946038
# [49] 0.02036327 0.08630780 0.01860873 0.06303455 0.01895086 0.01871783
# [55] 0.02924722 0.08475490 0.08575956 0.08299903 0.08161365 0.02532606
# [61] 0.02747711 0.09117602 0.08533186 0.11900393 0.08611137 0.02049278
# [67] 0.08114836 0.07215539 0.06997266 0.08368385 0.08486157 0.07210998
# [73] 0.09763424 0.09177915 0.08260885 0.01730037 0.07390761 0.10200900
# [79] 0.08319738 0.09293284 0.09269107 0.02454981 0.10519674 0.09717035
# [85] 0.08505843 0.09700546

and the plot looks like this:

Now let’s see how to find a good ratio cutoff. The best ratio should be where the biggest jump within the ratio values are! This is how it can be detected.

diffDf<- data.frame(ratio=sort(chrRatio),
                    consequtive_difference=c(NA,diff(sort(chrRatio))))

head(diffDf)
# ratio consequtive_difference
# 1 0.01196316                     NA
# 2 0.01244372           4.805589e-04
# 3 0.01249461           5.088199e-05
# 4 0.01344776           9.531592e-04
# 5 0.01506109           1.613323e-03
# 6 0.01619692           1.135837e-03

indJump<- which(
diffDf$consequtive_difference==
    max(diffDf$consequtive_difference, na.rm=TRUE))

We notice that after sorting (in increasing order), the biggest jump of ratio values is from 0.03803660 to 0.06224493. The sum of htese 2 values divided 2 would be a good choice which is ~ 0.05.

diffDf[c(indJump-1,indJump),]
# ratio consequtive_difference
# 31 0.03803660 0.008789386
# 32 0.06224493 0.024208324

sum(diffDf[c(indJump-1,indJump),1])/2
#[1] 0.05014077

We can Finally use the ratios and the cutoff to label males and females.

(gender<- c(“FEMALE”,”MALE”)[as.numeric(chrRatio>ratioCuftoff)+1])
#[1] “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “MALE”   “FEMALE” “MALE”
#[9] “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “FEMALE” “FEMALE”
#[17] “FEMALE” “FEMALE” “FEMALE” “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”
#[25] “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”
#[33] “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”
#[41] “FEMALE” “MALE”   “MALE”   “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE”
#[49] “FEMALE” “MALE”   “FEMALE” “MALE”   “FEMALE” “FEMALE” “FEMALE” “MALE”
#[57] “MALE”   “MALE”   “MALE”   “FEMALE” “FEMALE” “MALE”   “MALE”   “MALE”
#[65] “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”   “MALE”   “MALE”
#[73] “MALE”   “MALE”   “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”
#[81] “MALE”   “FEMALE” “MALE”   “MALE”   “MALE”   “MALE”

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related