(more on) Pattern Matching for Transcription Factor Binding Sites
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have published some initial script scribblings on this task about a week ago. After another week I’m posting some better formed and annotated code. The Biostrings and BSGenomes packages are new to me and I’ve gone through many many iterations and experimenations to arrive at the as yet incomplete code below. Whilst the packages seem powerful I’m yet to get the hang of the object structures and methods for the various DNAStrings, DNAStringSets, Views etc etc..– and work out how best to functionally program with them. Indeed my effort to avoid simple loops has probably caused me all sorts of problems….ho hum.
1. The task is take a list of promoters (human upstream 1000 from BSGenomes).
2. Annotate these with accessible HGNC symbols using biomaRt
3. Create a list of DNAStrings which are putative transcription factor binding sites
4. Match the list of patterns against the list of promoters.
5. return the matches in an intelligible form.
Presently point 5 still needs a lot of work… And I need to do more work on different mismatch and insertion (indel) parameters for the match functions.
## I am firstly downloading upstream sequences using BSGenome and the latest human sequence build ## library(BSgenome) library(BSgenome.Hsapiens.UCSC.hg19) ls('package:BSgenome.Hsapiens.UCSC.hg19') ## this creates subject a so-called DNAStringSet of upstream gene sequences i.e. promoters## subject=Hsapiens$upstream1000 ## I use biomaRt here to pull out the annotation data for the refseq DNAs that are extracted from ## the BSGenome upstream sequences- especially the familiar HGNC symbol mart <- useMart('ensembl') mart<-useDataset("hsapiens_gene_ensembl",mart) atts=c('chromosome_name', 'start_position', 'strand', 'hgnc_symbol', 'ens_hs_gene', 'refseq_dna') ## this gobbledeygook just extracts a name string splits it into a list ## then gets the first element (the refseq) back and unlists it to vector nm nm=unlist(lapply(strsplit(names(subject), split='_up'), function(x)x[1])) gb= getBM(attributes=atts, filters='refseq_dna', values=nm, mart=mart) ## NB Lookout! biomaRt neither returns answers (i.e rows) for all the query values (refseqs) ## nor does it return them in the same order as the query (they are alphabetic) ## -- so you cant just reorder results and use the same index ## match loks for a first match this gets all matches OFC match.all=function(query, set) { wh=vector() for(i in query) wh=c(wh, which(set==i)) return(wh) } ## these are genes of particular interest TO ME- ie. gene symbols my_genes=c('PROX1', 'FOXC2', 'SOX18', 'ANPEP', 'NR2F2') ## so for the right index we go hgnc -> refseq -> position of refseq in BioStringSet my_refseqs=gb$refseq[match.all(my_genes, gb$hgnc_symbol)] my_refseqs_index=match.all(my_refseqs, nm) ## I make a smaller subest of interesting gene promoters 'subject2' subject2=subject[my_refseqs_index] ## the query string and its reverse complement - for now: will experiment later - ## CMARE = DNAString('TGCTGACGTCAGCA') CMARE_complement = complement(CMARE) TRE= DNAString('TGACTCA') TRE_complement= complement(TRE) CRE= DNAString('TGACGTCA') CRE_complement= complement(TRE) TMARE=DNAString('TGCTGACTCAGCA') TMARE_complement=complement(TMARE) ## this is a Degenerate MARE where IUPAC DNA code N is any base DMARE=DNAString('TGCNNNNNNNNGCA') DMARE_complement=complement(DMARE) ## I am using the IUPAC DNA code 'W' for A/T there are 3 'W' but this is just an approx guess from the lit. ## hence the name ATMARE ATMARE=DNAString('WWWTGCTGAC') ATMARE_complement=complement(ATMARE) ## make big NAMED list of these DNAStrings DNASTRLIST=list(CMARE=CMARE, CMARE_complement=CMARE_complement, TRE=TRE, TRE_complement=TRE_complement, CRE=CRE, CRE_complement=CRE_complement, TMARE=TMARE, TMARE_complement=TMARE_complement, DMARE=DMARE, DMARE_complement=DMARE_complement, ATMARE=ATMARE, ATMARE_complement=ATMARE_complement) ## extract the accession name with a bit of messing about ## nm=names(subject) nm=strsplit(nm, '_up') nm=unlist(lapply(nm, function(x)x[1])) matching=function(myDNAString, subject, max.mismatch=2) { vmatch=vmatchPattern(myDNAString, subject, max.mismatch=max.mismatch) wh=which(countIndex(vmatch)>0) for(i in wh) { if(exists('match_list')) { match_list=c(match_list, list(Views(subject[[i]], vmatch[[i]]))) names(match_list)[length(match_list)]=names(subject[i]) } if(!exists('match_list')) { match_list= list(Views(subject[[i]], vmatch[[i]])) names(match_list)[length(match_list)]=names(subject[i]) } } return(match_list) } lapply(DNASTRLIST, function(eachMARE) matching(eachMARE, subject2))
This code gives a list of results that looks something like this (attenuated) :
> test[3] $TRE $TRE$CMARE MIndex object of length 8 $TRE$CMARE_complement MIndex object of length 8 $TRE$TRE MIndex object of length 8 $TRE$TRE_complement MIndex object of length 8 $TRE$CRE MIndex object of length 8 $TRE$CRE_complement MIndex object of length 8 $TRE$TMARE MIndex object of length 8 $TRE$TMARE_complement MIndex object of length 8 $TRE$DMARE MIndex object of length 8 $TRE$DMARE_complement MIndex object of length 8 $TRE$ATMARE MIndex object of length 8 $TRE$ATMARE_complement MIndex object of length 8 $TRE$`NM_002763_up_1000_chr1_214160860_f chr1:214160860-214161859` Views on a 1000-letter DNAString subject subject: CGGCGAACTCGATCAGCTGTATCTGGGAAATGAAAAAAGAAAAAGAAAAAAAAAA...GTCCCACTGCTCCCTGCACCGCGTAAGTATCTTCTTCTTCCCCTCGTGAGTCCCT views: start end width [1] 136 142 7 [TGGCTCG] [2] 333 339 7 [TGAATCC] [3] 420 426 7 [TGTCTTA] [4] 789 795 7 [TGTCGCA] [5] 797 803 7 [AGACTCA] [6] 992 998 7 [TGAGTCC] $TRE$`NM_005251_up_1000_chr16_86599857_f chr16:86599857-86600856` Views on a 1000-letter DNAString subject subject: CCTCTAAAACCTCGATAGGTTATCCTTGACGACCCCGAGCCTGGAAACTCCCTGT...GCCAGGAGCCCGGGGCCGCCCCTCCCGCTCCCCTCCTCTCCCCCTCTGGCTCTCT views: start end width [1] 59 65 7 [TGATTAA] [2] 71 77 7 [TGATTAA] [3] 131 137 7 [TGAATCC] [4] 184 190 7 [TCACTCA] [5] 204 210 7 [TGTCCCA] [6] 249 255 7 [TTACTCT] [7] 485 491 7 [CGACTCC] [8] 510 516 7 [TGGCTCA] [9] 622 628 7 [GGGCTCA] [10] 658 664 7 [TGACCCT] [11] 992 998 7 [TGGCTCT] $TRE$`NM_018419_up_1000_chr20_62680980_r chr20:62680980-62681979` Views on a 1000-letter DNAString subject subject: CCTTCTCCAGTAGAACCATGCGTGGCAGTGGGAGGGGAGGACCTCCAGAGGGAGA...GCCCTGAGCCGCTATATCTGGGCGCCCGCCCGGCCCGAGGCCACCGCCGTCCCCA views: start end width [1] 186 192 7 [GGACCCA] [2] 202 208 7 [TGCCTCT] [3] 374 380 7 [TGTGTCA] [4] 431 437 7 [TGTCCCA] [5] 443 449 7 [GCACTCA] [6] 469 475 7 [TTACACA] [7] 509 515 7 [GGACTCA] [8] 567 573 7 [GGACTTA] [9] 712 718 7 [CGACTCC] [10] 849 855 7 [GGCCTCA] and so on and on.....
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.