Bioinformatics Tutorial with Exercises in R (part 1)

Kalyan Srivastava

5 years ago

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bioinformatics is an interdisciplinary field of study that combines the field of biology with computer science to understand biological data. Bioinformatics is generally used in laboratories as an initial or final step to get the information. This information can subsequently be utilized for the wet lab practices. However, it can also be used as a standalone computational method to get the answers to relevant biological questions. Bioinformatics has greatly accelerated the discovery process and is used in both academics as well as industry to understand genomics & proteomics data, determine the protein structure, RNA structures, analyzing sequence data for molecular biological work, gene identification, to name a few applications. Over the years it has grown into a complex field of discipline requiring specific education and professional training.

We begin with a simple introduction in sequence data. This tutorial assumes that readers have basic knowledge about the central dogma and biological molecules such as protein, DNA and RNA; and also understand that they are biopolymers naturally or synthetically produced for a specific role. Wikipedia is a great resource to get the basic introduction on these. However, I will also write a few sentences in my exercises to refresh the knowledge of the readers. Numerous programming languages have been used for the development of Bioinformatics, ranging from low end languages such as Java, C/C++ to high end scripting languages such as perl, python and the R statistical language. Bioinformatics also involves extensive database management implementation for storage, query and updating the sequence and numerical data. Most of the Bioinformatics software can be implemented either on a Windows, Mac or Linux platform. This tutorial also assumes that the reader has some understanding about R programming, RStudio and installation of packages. We will use numerous packages both common as well as strictly developed for Bioinformatics.

The open source community known as Bioconductor specifically develops the Bioinformatics tools using R for the analysis and comprehension of high-throughput genomic data. It boasts to have two releases each year, 1296 software packages, and an active user community. It has packages developed for application ranging from basic sequence alignment to recent years’ NextGen sequencing machines. We hope that we will able to discuss more about it in later chapters, when not only readers would be able to appreciate its relevance but also utilize it in their daily examples. Let us begin with some little steps by installing a few packages on our machines. I personally prefer to begin with a fresh installation to make sure their are no conflicts from the preinstalled packages.

< aside class='stb-icon'>

Learn more about Bioconductor in the (Free!) online courses Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomics Assays and Case Studies in Functional Genomics offered by Harvard University.

To begin with let us first install a package known as Biostrings by running the following command on your RStudio.
source("https://bioconductor.org/biocLite.R")biocLite("Biostrings")
This would take up to 5 minutes depending upon your internet connection and computer. Note that typing install.packages(“biostrings”) in RStudio might not result into success because of version issues. As usual use library(Biostrings) command to load the package.

Deoxyribonucleic acid, or DNA, stores biological information which is expressed in form of RNA inside the cells. The two antiparallel DNA strands are termed polynucleotides and they are composed of simpler monomer units called nucleotides. Each nucleotide is composed of one of four nitrogen-containing nucleobases—either cytosine (C), guanine (G), adenine (A), or thymine (T). The randomly arranged nucleotide finally write up the code of life in the form of strict vocabulary.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Using the following command, create a list of DNA exactly the same as the one given below, then list the outcome by writing the vector myDNA.
myDNA <- c("ATGTTGCATTCATTAATTAAGAACGACCCAATACA")
myDNA

Exercise 2
In recent years there has been exponential advancement in the field of DNA sequencing. High throughput methods have been developed to speed up the sequencing project. Still the basic DNA sequencing is a technology that reads each nucleotide step-by-step by chemical methods to decipher what order of letters A, T, G and C were placed that resulted into the specific DNA sequence. However, for this problem we have given you a short result from sequencing data. In a typical laboratory sequencing results, some of the DNA sequence could look something like this “CTGATTT-GATGGTC-NAT” where apart from ATGC we could see some dashes (skipped) and N (unknown) nucleotide. Copy these into mySequencing and print the result. Use the length() command to find out the length. Which of the following is the length

A [1] 1
B [2] 19
C [1] 0

Exercise 3
Now use the package specific command called DNAString to copy the seq

myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT")
Now use length command to find the length of the myDNASeq. Compare the difference between mySequencing and myDNASeq .

Exercise 4
Use the class() command to analyze the datatype of the myDNA and myDNASeq. Which of the following represents the result

A
class(myDNASeq) [1] "DNAString" attr(,"package") [1] "Biostrings" > class(myDNA) [1] "integr"
B
> class(myDNASeq) [1] "DNAString" attr(,"package") [1] "Biostrings" > class(myDNA) [1] "character"

C None of the above
D Both

Exercise 5
In this problem let us attempt to paste the myDNA and myDNASeq sequences using paste and analyse the results. Try to understand the syntax of paste command using ?paste command. Copy the result into pastedDNA and print it by typing pastedDNA .
Try to analyze the pastedDNA using class command.

Exercise 6
Did you notice that final result was a character class and not the Biostring as expected? This would make pastedDNA not usable for biostring for any purpose. What happens is that Biostrings introduces a new data structure hierarchy which is different than the vector datatype of R. It has few sequence constructors such as DNAString(), RNAString(), AAString() for type of biomolecules (DNA, RNA and Amino Acids). This results into the fact that Biostrings objects cannot be compared with R strings ( myDNA == myDNASeq is an invalid command). You will always end up with FALSE upon attempting to compare.
Next we would look at some basic transformations in Biostrings that can be implemented on DNA data. These transformations are reverse() , complement() , reverseComplement() and translate() . Run the myDNASeq <- DNAString("CTGATTT-GATGGTC-NAT") again. Run the complement(myDNASeq) . Analyze the data.

Exercise 7
Did you find out what as happened in previous problem? This just created the complement of each nucleotide. The DNA usually exists as a double helix with both strands running antiparallel. Each base of ATGC is paired with some base on complementary strand. A has preference for T (and vise versa), G has preference for C (and vise versa). So the complementary strand is going to have the nucleotide that is pairs preferentially. However, the unknown nucleotide N gets written as N because the sequencer could not tell what it was.
In next problem run reverse(myDNASeq)

Exercise 8
Did you notice what has happened? Did nucletide in DNA sequences got read from back to front and not front to back? Or not. Each complementary strand is usually written from back to front because the complementary strands are anti-parallel. This because the sign of each strand is opposite. This is basically because the strand which basically is a polymer of nucleotide either has 5’ or 3’ end (more about it later). So the +ve strand runs from 5’-3’ and -ve strand runs from 3’-5’. In the next problem run reverseComplement(myDNASeq)

Exercise 9
Did you notice what happened. The DNA sequences not only got complementary but also reversed. This is how double stranded DNA exists in nature.
Run alphabetFrequency(myDNASeq)

Exercise 10
The values in problem 9 gave you the frequency of occurrence of each nucleotide in this DNA. This is an important thing to know when analyzing the DNA.
Finally run translate(myDNASeq) . This would yield the hypothetical protein sequence that myDNASeq would produce. Afterall that is one of the important role of DNA, to code for the protein. What did you get?

In our next exercises we will work little more with Biostrings to analyze the DNA at little more. Happy learning.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.