Generate artificial DNA or protein sequences in R in a single line of code.
[This article was first published on Computational Biology Blog in fasta format, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
To generate an artificial DNA sequence of “n” bases long with a fixed composition bias in just one line of code, just open your R prompt and type:
As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or "bias" of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the "A" base we got a probability of 0.4, for the "C" base 0.2 and so on.
Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:
Lets extract the proportion of each base along the sequence using the table and the length function:
Now, lets plot it doing:
And finally we got this awesome plot.
seqX <- sample(c("A","C","G","T"),10000,rep=TRUE,prob=c(0.4,0.1,0.1,0.4))
As you see, the alphabet is the four letter alphabet of the DNA (so, if desired, you can replace that alphabet with any other alphabet that you might need, like the amino acids 20 letter code), the next parameter is the length of the desired output sequence. rep=TRUE means that each letter of the alphabet could be repeated and finally, I think that the most useful parameter of the function sample is the prob parameter because it allows you to select the multinomial model (the proportion or "bias" of each base). For example our simulated sequence is 80% AT rich and 20% GC rich given that for the "A" base we got a probability of 0.4, for the "C" base 0.2 and so on.
Now, to check it out that our artificial sequence follows that bias, a simple plot will tell us more than thousand words:
Lets extract the proportion of each base along the sequence using the table and the length function:
freqX <- table(seqX)/length(seqX)
Now, lets plot it doing:
barplot(freqX,col=1:4,main="Compositional bias of seqX",xlab="Base",ylab="Base proportion")
And finally we got this awesome plot.
So, the barplot shows that effectively each base is well represented by the multinomial model and our artificial sequence is loyal to it.
Benjamin
To leave a comment for the author, please follow the link and comment on their blog: Computational Biology Blog in fasta format.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.