Extract different characters between two strings of equal length

fabiomarroni

10 years ago

[This article was first published on Fabio Marroni's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the desperate effort of understanding the secret of life it may be too simplistic to just count the differences between two strings of equal length. You might as well want to know where they differ.

You can do that recycling most of the function published in a previous post.

You can use it to compare two nucleotide sequences, two amino acid sequences or just two strings of equal length.

To facilitate analysis of DNA sequences, I gave the opportunity to ignore case (so that an “a” is equivalent to “A”) and to ignore positions carrying special characters, such as “N”, “?” or “-”.

Input:
a: first sequence (or string)
b: second sequence (or string)
exclude: character (or vector of characters) to be excluded from comparison
ignore.case: logical. If TRUE consider “a” equal to “A”. If FALSE consider “a” different from “A”.
show.excluded: logical. If TRUE, positions carrying an excluded character (such as “-”) will be returned, as NA. IF FALSE, positions in which one of the two strings carry an excluded character will be omitted.

Output:
only.diff: Matrix (or vector) reporting differences between the two strings and their position.

As usual, feedback welcome!

list.string.diff<-function(a="ATTCGA-",b="attTGTT",exclude=c("-","?"),ignore.case=TRUE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case==TRUE)
{
a<-toupper(a)
b<-toupper(b)
}
seq.a<-unlist(strsplit(a,split=""))
seq.b<-unlist(strsplit(b,split=""))
diff.d<-rbind(seq.a,seq.b)
only.diff<-diff.d[,diff.d[1,]!=diff.d[2,]]
pos<-which(diff.d[1,]!=diff.d[2,])
only.diff<-rbind(pos,only.diff)
for(ex.loop in 1:length(exclude))
{
only.diff<-only.diff[,!(only.diff["seq.a",]==exclude[ex.loop]|only.diff["seq.b",]==exclude[ex.loop])]
}
return(only.diff)
}

It might happen that you have two DNA or amino acid sequences of different length. I suggest that you align them with clustalW and save the output in a text file. You can then use the read.alignment() function of the R package seqinr to get the two aligned sequences, padded by “-” so that they have the same length.
I paste below an example of clustalW output

CLUSTAL 2.1 multiple sequence alignment


test1      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
test2      MESLGVRKGAWIQEEDVLLRKCIEKYGEGKWHLVPLRAGLNRCRKSCRLRWLNYLKPDIK
           ************************************************************

test1      RGEFALDEVDLMIRLHNLLGNRWSLIAGRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
test2      RGEFALDEVDLMI--------------GRLPGRTANDVKNYWHSHHFKKEVQFQEEGRDK
           ************************************************************

test1      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSRKPSSTSPQPNDDIIWWES
test2      PQTHSKTKAIKPHPHKFSKALPRFELKTTAVDTFDTQVSTSSKPSSTSPQPNDDIIWWES
           ***************************************** ******************

test1      LLAEHAQMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
test2      LLAEHAPMDQETDFSASGEMLIASLRTEETATQKKGPMDGMIEQIQGGEGDFPFDVGFWD
           ****** *****************************************************

test1      TPNTQVNHLI
test2      TPNTQVNHLI
           **********

You can then save this alignment with the name “test.clustalw”, read it with the following function which calls both read.alignment() and list.string.diff().

comp.seq<-function(infile="test.clustalw",format="clustal",
                   exclude=c("-","?"))
{
library(seqinr)
pseq<-read.alignment(infile,format=format)
pseq1<-as.character(pseq$seq[1])
pseq2<-as.character(pseq$seq[2])
pdiff<-list.string.diff(a=pseq1,b=pseq2,exclude=exclude)
return(pdiff)
}

If you try, you should get the following output

      [,1]  [,2] 
pos   "162" "187"
seq.a "R"   "Q"  
seq.b "S"   "P"

Some nice improvements to the function. Thanks to richierocks for his comments!

list.string.diff <- function(a, b, exclude = c("-", "?"), ignore.case = TRUE, show.excluded = FALSE)
{
if(nchar(a)!=nchar(b)) stop("Lengths of input strings differ. Please check your input.")
if(ignore.case)
{
a <- toupper(a)
b <- toupper(b)
}
split_seqs <- strsplit(c(a, b), split = "")
only.diff <- (split_seqs[[1]] != split_seqs[[2]])
only.diff[
(split_seqs[[1]] %in% exclude) |
(split_seqs[[2]] %in% exclude)
] <- NA
diff.info<-data.frame(which(is.na(only.diff)|only.diff),
split_seqs[[1]][only.diff],split_seqs[[2]][only.diff])
names(diff.info)<-c("position","poly.seq.a","poly.seq.b")
if(!show.excluded) diff.info<-na.omit(diff.info)
diff.info
}

To leave a comment for the author, please follow the link and comment on their blog: Fabio Marroni's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.