Extract different characters between two strings of equal length
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In the desperate effort of understanding the secret of life it may be too simplistic to just count the differences between two strings of equal length. You might as well want to know where they differ.
You can do that recycling most of the function published in a previous post.
You can use it to compare two nucleotide sequences, two amino acid sequences or just two strings of equal length.
To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N”, “?” or “-”.
Input:
a: first sequence (or string)
b: second sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.
show.excluded: logical. If TRUE, positions carrying an excluded character (such as “-”) will be returned, as NA. IF FALSE, positions in which one of the two strings carry an excluded character will be omitted.
Output:
only.diff: Matrix (or vector) reporting differences between the two strings and their position.
As usual, feedback welcome!
list.string.diff<-function(a="ATTCGA-",b="attTGTT",exclude=c("-","?"),ignore.case=TRUE) { if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.") if(ignore.case==TRUE) { a<-toupper(a) b<-toupper(b) } seq.a<-unlist(strsplit(a,split="")) seq.b<-unlist(strsplit(b,split="")) diff.d<-rbind(seq.a,seq.b) only.diff<-diff.d[,diff.d[1,]!=diff.d[2,]] pos<-which(diff.d[1,]!=diff.d[2,]) only.diff<-rbind(pos,only.diff) for(ex.loop in 1:length(exclude)) { only.diff<-only.diff[,!(only.diff["seq.a",]==exclude[ex.loop]|only.diff["seq.b",]==exclude[ex.loop])] } return(only.diff) }
It might happen that you have two DNA or amino acid sequences of different length. I suggest that you align them with clustalW and save the output in a text file. You can then use the read.alignment() function of the R package seqinr to get the two aligned sequences, padded by “-” so that they have the same length.
I paste below an example of clustalW output
CLUSTAL 2.1 multiple sequence alignment test1 MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK test2 MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK ************************************************************ test1 RGEFALDEVDLMIRLHNLLGNRWSLIAGRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK test2 RGEFALDEVDLMI--------------GRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK ************************************************************ test1 PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSRKPSSTSPQPNDDIIWWES test2 PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSSKPSSTSPQPNDDIIWWES ***************************************** ****************** test1 LLAEHAQMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD test2 LLAEHAPMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD ****** ***************************************************** test1 TPNTQVNHLI test2 TPNTQVNHLI **********
You can then save this alignment with the name “test.clustalw”, read it with the following function which calls both read.alignment() and list.string.diff().
comp.seq<-function(infile="test.clustalw",format="clustal", exclude=c("-","?")) { library(seqinr) pseq<-read.alignment(infile,format=format) pseq1<-as.character(pseq$seq[1]) pseq2<-as.character(pseq$seq[2]) pdiff<-list.string.diff(a=pseq1,b=pseq2,exclude=exclude) return(pdiff) }
If you try, you should get the following output
[,1] [,2] pos "162" "187" seq.a "R" "Q" seq.b "S" "P"
Some nice improvements to the function. Thanks to richierocks for his comments!
list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE) { if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.") if(ignore.case) { a <- toupper(a) b <- toupper(b) } split_seqs <- strsplit(c(a, b), split = "") only.diff <- (split_seqs[[1]] != split_seqs[[2]]) only.diff[ (split_seqs[[1]] %in% exclude) | (split_seqs[[2]] %in% exclude) ] <- NA diff.info<-data.frame(which(is.na(only.diff)|only.diff), split_seqs[[1]][only.diff],split_seqs[[2]][only.diff]) names(diff.info)<-c("position","poly.seq.a","poly.seq.b") if(!show.excluded) diff.info<-na.omit(diff.info) diff.info }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.