Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We begin with a simple introduction in sequence data. This tutorial assumes that readers have basic knowledge about the central dogma and biological molecules such as protein, DNA and RNA; and also understand that they are biopolymers naturally or synthetically produced for a specific role. Wikipedia is a great resource to get the basic introduction on these. However, I will also write a few sentences in my exercises to refresh the knowledge of the readers. Numerous programming languages have been used for the development of Bioinformatics, ranging from low end languages such as Java, C/C++ to high end scripting languages such as perl, python and the R statistical language. Bioinformatics also involves extensive database management implementation for storage, query and updating the sequence and numerical data. Most of the Bioinformatics software can be implemented either on a Windows, Mac or Linux platform. This tutorial also assumes that the reader has some understanding about R programming, RStudio and installation of packages. We will use numerous packages both common as well as strictly developed for Bioinformatics.
The open source community known as Bioconductor specifically develops the Bioinformatics tools using R for the analysis and comprehension of high-throughput genomic data. It boasts to have two releases each year, 1296 software packages, and an active user community. It has packages developed for application ranging from basic sequence alignment to recent years’ NextGen sequencing machines. We hope that we will able to discuss more about it in later chapters, when not only readers would be able to appreciate its relevance but also utilize it in their daily examples. Let us begin with some little steps by installing a few packages on our machines. I personally prefer to begin with a fresh installation to make sure their are no conflicts from the preinstalled packages.
To begin with let us first install a package known as Biostrings by running the following command on your RStudio.
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
This would take up to 5 minutes depending upon your internet connection and computer. Note that typing install.packages(“biostrings”) in RStudio might not result into success because of version issues. As usual use library(Biostrings)
command to load the package.
Deoxyribonucleic acid, or DNA, stores biological information which is expressed in form of RNA inside the cells. The two antiparallel DNA strands are termed polynucleotides and they are composed of simpler monomer units called nucleotides. Each nucleotide is composed of one of four nitrogen-containing nucleobases—either cytosine (C), guanine (G), adenine (A), or thymine (T). The randomly arranged nucleotide finally write up the code of life in the form of strict vocabulary.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Using the following command, create a list of DNA exactly the same as the one given below, then list the outcome by writing the vector myDNA.
myDNA <- c("ATGTTGCATTCATTAATTAAGAACGACCCAATACA")
myDNA
Exercise 2
In recent years there has been exponential advancement in the field of DNA sequencing. High throughput methods have been developed to speed up the sequencing project. Still the basic DNA sequencing is a technology that reads each nucleotide step-by-step by chemical methods to decipher what order of letters A, T, G and C were placed that resulted into the specific DNA sequence. However, for this problem we have given you a short result from sequencing data. In a typical laboratory sequencing results, some of the DNA sequence could look something like this “CTGATTT-GATGGTC-NAT” where apart from ATGC we could see some dashes (skipped) and N (unknown) nucleotide. Copy these into mySequencing and print the result. Use the length()
command to find out the length. Which of the following is the length
A [1] 1
B [2] 19
C [1] 0
Exercise 3
Now use the package specific command called DNAString
to copy the seq
myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT")
Now use length
command to find the length of the myDNASeq. Compare the difference between mySequencing
and myDNASeq
.
Exercise 4
Use the class()
command to analyze the datatype of the myDNA and myDNASeq. Which of the following represents the result
A
class(myDNASeq)
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
> class(myDNA)
[1] "integr"
B
> class(myDNASeq)
[1] "DNAString"
attr(,"package")
[1] "Biostrings"
> class(myDNA)
[1] "character"
C None of the above
D Both
Exercise 5
In this problem let us attempt to paste the myDNA
and myDNASeq
sequences using paste
and analyse the results. Try to understand the syntax of paste
command using ?paste
command. Copy the result into pastedDNA
and print it by typing pastedDNA
.
Try to analyze the pastedDNA
using class
command.
Exercise 6
Did you notice that final result was a character
class and not the Biostring
as expected? This would make pastedDNA
not usable for biostring for any purpose. What happens is that Biostrings
introduces a new data structure hierarchy which is different than the vector
datatype of R. It has few sequence constructors such as DNAString()
, RNAString()
, AAString()
for type of biomolecules (DNA, RNA and Amino Acids). This results into the fact that Biostrings
objects cannot be compared with R strings ( myDNA == myDNASeq
is an invalid command). You will always end up with FALSE
upon attempting to compare.
Next we would look at some basic transformations in Biostrings that can be implemented on DNA data. These transformations are reverse()
, complement()
, reverseComplement()
and translate()
. Run the myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT")
again. Run the complement(myDNASeq)
. Analyze the data.
Exercise 7
Did you find out what as happened in previous problem? This just created the complement of each nucleotide. The DNA usually exists as a double helix with both strands running antiparallel. Each base of ATGC is paired with some base on complementary strand. A has preference for T (and vise versa), G has preference for C (and vise versa). So the complementary strand is going to have the nucleotide that is pairs preferentially. However, the unknown nucleotide N gets written as N because the sequencer could not tell what it was.
In next problem run reverse(myDNASeq)
Exercise 8
Did you notice what has happened? Did nucletide in DNA sequences got read from back to front and not front to back? Or not. Each complementary strand is usually written from back to front because the complementary strands are anti-parallel. This because the sign of each strand is opposite. This is basically because the strand which basically is a polymer of nucleotide either has 5’ or 3’ end (more about it later). So the +ve strand runs from 5’-3’ and -ve strand runs from 3’-5’. In the next problem run reverseComplement(myDNASeq)
Exercise 9
Did you notice what happened. The DNA sequences not only got complementary but also reversed. This is how double stranded DNA exists in nature.
Run alphabetFrequency(myDNASeq)
Exercise 10
The values in problem 9 gave you the frequency of occurrence of each nucleotide in this DNA. This is an important thing to know when analyzing the DNA.
Finally run translate(myDNASeq) . This would yield the hypothetical protein sequence that myDNASeq would produce. Afterall that is one of the important role of DNA, to code for the protein. What did you get?
In our next exercises we will work little more with Biostrings to analyze the DNA at little more. Happy learning.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.