Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’m trying to search for binding sites for the transcription factor MAF (i.e. TFBS for MAF) in the promoter regions of various genes. I initially started out looking at a precomputed database of binding sites MAPPER. However the TFBS models that have been used from TRANSFAC or JASPAR are quite unlike the human binding sites shown in the recent literature (the sites in the recent literature are palindromic sequences called MAREs).
Unfortunately the palindromic sites shown in the literature do not have alignment or position-weight-matrices (PWM) and they are quite variable. It would appear that these binding sites have not been derived by an unbiased screen i.e. ChIP with an antibody against a genome. Rather they have mutated a known site in a single gene promoter and tested which ones still work. Jeez!
So I cannot create my own alignments or position weight matrices. Instead I will try some cruder pattern matching and test a few mismatch and indel options (possibly with IUPAC coding later). Here is scripting for a first effort:
## I am firstly downloading upstream sequences using BSGenome and the latest human sequence build ## library(BSgenome) library(BSgenome.Hsapiens.UCSC.hg19) ls('package:BSgenome.Hsapiens.UCSC.hg19') ## this creates subject a so-called DNAStringSet of upstream gene sequences i.e. promoters## subject=Hsapiens$upstream1000 ## the query string and its reverse complement - for now: will experiment later - ## CMARE <- DNAString('TGCTGACGTCAGCA') CMARE_complement <- complement(CMARE) ## extract the accession name with a bit of messing about ## nm=names(subject) nm=strsplit(nm, '_up') nm=unlist(lapply(nm, function(x)x[1])) ## this is matching forward and complement CMARE DNAStrings to the promoters DNAStringSet ## mindexF <- vmatchPattern(CMARE, subject, max.mismatch=1) mindexR <- vmatchPattern(CMARE_complement, subject, max.mismatch=1) ## I'm using 1 mismatch here but will probably adapt later ## ## create indices whF + whR of the matches count_indexF <- countIndex(mindexF) count_indexR <- countIndex(mindexR) total_matches= sum(count_indexF)+ sum(count_indexR) # Total number of matches. whF <- which(count_indexF > 0) whR <- which(count_indexR > 0) ## next we extract the data for each and put it in a sort of list view ## result_listF=list() result_listR=list() for(i in 1: length(whF)) { result_listF[[i]]= Views(subject[[whF[i]]], mindexF[[whF[i]]]) } for(i in 1: length(whR)) { result_listR[[i]]= Views(subject[[whR[i]]], mindexR[[whR[i]]]) } ## and add the original accesion names to the correct list index names(result_listF)=nm[whF] names(result_listR)=nm[whR] result_list=c(result_listF, result_listR)
Now… I don’t like this script much as it seems a faff to do everything twice for forward and reverse strands… but I was in a hurry an didn’t have time to really understand the object classes and combine them properly. I also never like using for loops but in this case I couldn’t see any way to make it more functional stylee..
Anyway that gives a result something like this:
> result_list[1:2] $NM_181575 Views on a 1000-letter DNAString subject subject: CAGACCTGGCCTACCCCAGCCCCCACCTTAGTCTGAATCCTCAGCGTTGCGATG...AGAAGTGCACGCCGGGACCAGGATTCCGGGAGGCCGACTCCTCCCTGCCCCAC views: start end width [1] 716 729 14 [TCCTGACGTCAGCA] $NM_173826 Views on a 1000-letter DNAString subject subject: CACTTGGGGGATTTTCTGAGGACTTTGTACTAGTGGCATATGTAACTCCTGTGC...CAGGGCACCGCGGAAGGTTGAATCCACGCCCTGGCTGTGCGCAGGCGCCGTAG views: start end width [1] 929 942 14 [TGCTGACGTCAGCT]
Next I need to use biomart and Ensembl to try and get gene symbols, chromosomes, start positions, strand etc.. for the promoters. There is an R interface package biomaRt for this service. Once again however Ensembl/biomart is having a nervous breakdown– so I’ll probably continue this tomorrow.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.