R vs. Perl/mySQL – an applied genomics showdown
[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R vs. Perl/mySQL – an applied genomics showdown
Recently I was given an assignment for a class I’m taking that got me thinking about speed in R. This isn’t something I’m usually concerned with, but the first time I tried to run my solution (ussing plyr’s
ddply()
it was going to take all night to compute. I consulted the professor that taught me to use R and he said that if I could boil the computationally intensive portion of the program down to an apply()
command I could take advantage of the vectorized nature of R and greatly improve my speed. I constructed my solution around this methodology and had great success. The assignment as it was given is as follows:Match Regulatory Elements in the Genome to the Closest Gene
Genomic genius Ginny Genes has performed a gigantic genome-wide experiment to find regulatory elements in human embryonic stem cells. She wants to know what are the possible target genes of these regulatory elements. She has the locations of the regulatory elements in a bed-formatted, text, tab-delimited file with the following fields:
1) Chromosome
2) Start position in chromosome
3) End position in chromosome
4) Randomly assigned name
Your assignment is to help Ginny by finding for each regulatory element both the closest gene upstream and the closest gene downstream of the regulatory element.
The input file “RegulatoryElements.bed” contains the list of regulatory positions. You should open this file, read each line one at a time, search for the gene upstream and downstream closest based on the transcription start site, and print out the following information for each on a single line:
1) Chromosome of gene
2) Start position of gene
3) End position of gene
4) Name of gene (common name – name2 field)
5) Genomic strand (+ or -)
6) Name of corresponding regulatory element
My solution:
First I need to read the files described above into R. This is simple enough, but for added speed I use the
colClasses =
argument in the read.table()
function. Basically this allows me to pre-specify what class each column of the file should be read in as. This saves some time because normally R decides this for you and if you’re reading in a lot of data, this decision requires a lot of information and time. For example, if I have a file with a million rows of data, and all of the entries in column 1 are numbers except the last one, which is a character string, then R must check each entry in this column before eventually assigning it to be of class factor
. This is time consuming. Also, R by default converts any column with even one string in it as a factor, which requires some conversion. I the example I just gave of 999,999,999 numbers and 1 string, if all 999,999,999 numbers were unique, I would get a factor with a million levels. This would take some time to create. Notice below that I avoid creating factor variables except in particularly simple to convert cases where there are only a few possible levels. Reading in things like Gene Name ascharacter
columns instead of factors saves some time.# Clear memory rm(list = ls()) # Load in the file containing Regulatory Element data mydat <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/RegulatoryElements.bed", header = F, sep = "\t", colClasses = c("factor", "numeric", "numeric", "character")) names(mydat) <- c("chrom", "txStart", "txEnd", "name") # Load in Genes data genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", sep = "\t", header = F, colClasses = c("integer", "character", "factor", "factor", rep("numeric", 4), "integer", "character", "character", "integer", "character", "factor", "factor", "character")) names(genes) <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", "cdsEndStat", "exonFrames")
These are pretty big data frames, particularly the
genes
data frame:dim(mydat) ## [1] 129061 4 dim(genes) ## [1] 167536 16
Lets actually compare the time it takes to read the data with and without specifying my
colClasses
up front. This is easily done by wrapping the command you wish to time (in this case my call to read.table()
) inside the function system.time()
. I'll compare for just the large genes
data frame.# Time without specifying colClasses system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", sep = "\t", header = F)) ## user system elapsed ## 27.19 0.11 27.41 # Time with colClasses specified system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", sep = "\t", header = F, colClasses = c("integer", "character", "factor", "factor", rep("numeric", 4), "integer", "character", "character", "integer", "character", "factor", "factor", "character"))) ## user system elapsed ## 8.47 0.05 8.54
That's a pretty big difference!
A few more words about the
system.time()
function: This function doesn't just return a summary of the time ellapsed, it also actually executes the command wrapped inside. So after running either of the commands above I get a summary of the time ellapsed in my output and the object genes
now exists in my workspace.Now that I have my data loaded in, it's time to start mathcing regulatory elements in my data frame
mydat
to nearby genes in my huge data frame named genes
.Since it is my goal to eventually use an
apply()
function to accomplish this, and apply()
functions are most efficient when each iteration only produces one number like an index or subscript, I will create an index variable in thegenes
data frame that I will be searching through to find the nearest gene.# Create and ID variable for the genes dataframe genes$ID <- 1:nrow(genes)
Next, lets think about what we are actually going to do. For each regulatory element in
mydat
(one is described in each row), I want to find the closest gene, both upstream and downstream. For a particular regulatory element, I can immediately eliminate from my search all of the genes that are on a different chromosome. I anticipate that this will save me lots of search time by narrowing down where I'm searching. But I don't want to subset my enormousgenes
dataframe for each iteration of my solution function that I will be apply()
ing to mydat
. So here I call on a cool function split()
to break genes
into seperate data frames (and deposit them into a list) one time instead of subsetting genes
over and over again. To get a better idea of what split()
does, let's look at a simple example.First create an example data frame:
ex <- data.frame(chrom = rep(1:2, each = 13), txStart = letters) ex ## chrom txStart ## 1 1 a ## 2 1 b ## 3 1 c ## 4 1 d ## 5 1 e ## 6 1 f ## 7 1 g ## 8 1 h ## 9 1 i ## 10 1 j ## 11 1 k ## 12 1 l ## 13 1 m ## 14 2 n ## 15 2 o ## 16 2 p ## 17 2 q ## 18 2 r ## 19 2 s ## 20 2 t ## 21 2 u ## 22 2 v ## 23 2 w ## 24 2 x ## 25 2 y ## 26 2 z
Now
split()
it by the column chrom
:split(ex, ex$chrom) ## $`1` ## chrom txStart ## 1 1 a ## 2 1 b ## 3 1 c ## 4 1 d ## 5 1 e ## 6 1 f ## 7 1 g ## 8 1 h ## 9 1 i ## 10 1 j ## 11 1 k ## 12 1 l ## 13 1 m ## ## $`2` ## chrom txStart ## 14 2 n ## 15 2 o ## 16 2 p ## 17 2 q ## 18 2 r ## 19 2 s ## 20 2 t ## 21 2 u ## 22 2 v ## 23 2 w ## 24 2 x ## 25 2 y ## 26 2 z
This is a list of length 2. Each element in the list is a data frame. Notice the names of the list elements:
names(split(ex, ex$chrom)) ## [1] "1" "2"
The list element named
"1"
contains all of the rows from the original data frame (ex
) that had the value “1” in the chrom
column.Now lets diverge again from the programming and talk about the science of what we're trying to do. We are trying to find the gene closest to each transcription factor because that is the gene that the transcription factor most likely acts on. Transcription happens directionally. It always happens from the 5' end to the 3' end of the DNA strand. This would be left to right on the positive strand and right to left on the negative strand - both strands are numbered left to right, even the negative strand. How do I know this?
sum(genes$txEnd < genes$txStart) ## [1] 0
None of the
txEnd
(locant of where the gene ends) values are less than the txStart
(where the gene begins) values. If the locants were numbered in the order of the way they were transcribed, all of the txStart
sites on the negative strand would be less than the txEnd
sites on the lagging (-) strand.It only makes sense that a regulatory element would act on a gene downstream of it, so on the positive strand, we should look for the nearest
txStart
site and on the negative strand we should look for the nearest txEnd
site. This is actually the transcription start site. So if we need to search different ways depending on which strand we are on, we can further subdivide the search task. Now I want to split()
my genes
data frame by chromosome and strand. How might I go about split()
ing by 2 variables?The easiest way would be to create a new variable that is a combination of
chrom
and strand
, then split by that. An example:ex$strand <- rep(c("+", "-"), 13) ex$new_variable <- paste(ex$chrom, ex$strand) ex ## chrom txStart strand new_variable ## 1 1 a + 1 + ## 2 1 b - 1 - ## 3 1 c + 1 + ## 4 1 d - 1 - ## 5 1 e + 1 + ## 6 1 f - 1 - ## 7 1 g + 1 + ## 8 1 h - 1 - ## 9 1 i + 1 + ## 10 1 j - 1 - ## 11 1 k + 1 + ## 12 1 l - 1 - ## 13 1 m + 1 + ## 14 2 n - 2 - ## 15 2 o + 2 + ## 16 2 p - 2 - ## 17 2 q + 2 + ## 18 2 r - 2 - ## 19 2 s + 2 + ## 20 2 t - 2 - ## 21 2 u + 2 + ## 22 2 v - 2 - ## 23 2 w + 2 + ## 24 2 x - 2 - ## 25 2 y + 2 + ## 26 2 z - 2 - split(ex, ex$new_variable) ## $`1 -` ## chrom txStart strand new_variable ## 2 1 b - 1 - ## 4 1 d - 1 - ## 6 1 f - 1 - ## 8 1 h - 1 - ## 10 1 j - 1 - ## 12 1 l - 1 - ## ## $`1 +` ## chrom txStart strand new_variable ## 1 1 a + 1 + ## 3 1 c + 1 + ## 5 1 e + 1 + ## 7 1 g + 1 + ## 9 1 i + 1 + ## 11 1 k + 1 + ## 13 1 m + 1 + ## ## $`2 -` ## chrom txStart strand new_variable ## 14 2 n - 2 - ## 16 2 p - 2 - ## 18 2 r - 2 - ## 20 2 t - 2 - ## 22 2 v - 2 - ## 24 2 x - 2 - ## 26 2 z - 2 - ## ## $`2 +` ## chrom txStart strand new_variable ## 15 2 o + 2 + ## 17 2 q + 2 + ## 19 2 s + 2 + ## 21 2 u + 2 + ## 23 2 w + 2 + ## 25 2 y + 2 +
Notice again the names of this list:
names(split(ex, ex$new_variable)) ## [1] "1 -" "1 +" "2 -" "2 +"
This does the job nicely, but I can do this slightly more directly without cluttering up my workspace:
genes.list <- split(genes, paste(genes$chrom, genes$strand))
From this a list is created and each element of the list is a dataframe that is a subset of the genes dataframe. The first element is all those rows in genes which have chrom==1 and strand== -, named “chr1 -” and so on and so forth.
Next, I create 2 variables in
mydat
to pick out the appropriate element of genes.list
to search through.mydat$myvar1 <- paste(mydat$chrom, "+") mydat$myvar2 <- paste(mydat$chrom, "-")
Now each row of
mydat
is associated with 2 elements of genes.list
- the '+' and '-' data frames for the correct chromosome.Now it is time to create a function to return the ID number (which is really a row number in the original genes dataframe) of the closest gene on the positive strand. I'll create a seperate function to find the closest gene on the negative strand.
In order to
apply()
this function later, I must reference my columns by indices, not by name. For example if I wanted to apply()
a function that simply printed the chrom
for each row in mydat
, I would have to write print(x[1])
and I could not write print(x$chrom)
because x
is a vector of just one row, not a data frame with names.find.pos.indices <- function(x) { f1 <- x[[5]] f2 <- as.numeric(x[[2]]) # This is the value of the reg. element's txStart, converted to numeric (from integer) myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)), "ID"] # This finds the row number within the correct subsetted dataframe that # contains the closest gene, then returns the value of ID in that row. }
Here lies the crux of this program.
apply()
functions in R take advantage of the vectorized nature of the language to do what would otherwise be done in loops in a much more efficient way. So I apply()
my function over each row of the data frame named mydat
and deposit the results into pos.indices
, a vector of integers.pos.indices <- apply(mydat, 1, find.pos.indices)
pos.indices
is just a list of rownumbers that correspond to closest genes, so I use this to attach the variables of interest in the genes
data frame to the rows of mydat
. For example if pos.indices were only the number 10, R would assign the 10th value of genes$txStart to mydat$PtxStart.mydat$PtxStart <- genes$txStart[pos.indices] mydat$PtxEnd <- genes$txEnd[pos.indices] mydat$Pname2 <- genes$name2[pos.indices] mydat$Pstrand <- genes$strand[pos.indices]
Now repeat the same procedure from above, but for the negative strand.
find.neg.indices <- function(x) { f1 <- x[[6]] f2 <- as.numeric(x[[3]]) myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)), "ID"] } neg.indices <- apply(mydat, 1, find.neg.indices) mydat$NtxStart <- genes$txStart[neg.indices] mydat$NtxEnd <- genes$txEnd[neg.indices] mydat$Nname2 <- genes$name2[neg.indices] mydat$Nstrand <- genes$strand[neg.indices] # Get rid of the variables I created mydat$myvar1 <- NULL mydat$myvar2 <- NULL
Done! Now lets save the resulting data frame and see how long this entire process takes.
# Write the dataframe to a .csv file write.csv(mydat, file = "< Path/Filename.csv >")
If you want to clock the time of this program, save the code as a .R file (an R script), then enter the following command:
system.time(source("<path_to_R_script>"))
On my computer, the entire process from reading in the data to writing the resulting data frame to a .csv file took 58.42 seconds. A friend of mine used Perl to interface with a mySQL database that already had the data loaded into it and his program took over 700 seconds to accomplish the same task. This was the way we were supposed to complete the assignment for class, but instead of struggling with the unfamiliar Perl syntax I used my native language ® and had great success. Many of my classmates had run times of 7 or more hours. Needless to say, that is a far cry from under a minute.
One of the ideas of the assignment was to get us to interface with a database and I did not do this, but I was kind of proud of myself that I managed to do the entire assignment in base R with no packages.
To leave a comment for the author, please follow the link and comment on their blog: bRogramming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.