[This article was first published on bRogramming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For a class I’m taking this semester on genomics we’re dealing with some pretty large data and for this reason we’re learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here’s our first bit of work in mySQL, solved in R:< o:p>
BIOL 525 Lecture 3: In class work
Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or
From the Downloads folder in the Resources section of the wiki:
1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)
2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)
I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the
read.table()
command. This is the simplest way of reading data into R. Note the argument sep='\t'
in the read.table()
command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", sep = "\t") mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId", "transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId", "havanaTranscriptId", "ccdsId", "level", "transcriptClass") names(attributes) <- mynames genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", sep = "\t") mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", "cdsEndStat", "exonFrames") names(genes) <- mynames1
Answer the following questions:
1) How many entries are in the Gencode gene table?
dim(genes)[1] ## [1] 167536
2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846)) ## [1] 142
3) How many of these entries are for genes on the negative strand.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & strand == "-")) ## [1] 90
4) How many of these negative strand genes have more than 15 exons.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & strand == "-" & exonCount > 15)) ## [1] 23
5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.
length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & strand == "-" & exonCount > 15)$name2)) ## [1] 3
6) How many entries (transcripts) are there for BRCA1.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & strand == "-" & exonCount > 15 & name2 == "BRCA1")) ## [1] 15
** Bonus **
7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.
Hint: You can ORDER BY multiple fields.
my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & strand == "-" & exonCount > 15)$name2) unique(subset(attributes, geneName %in% my3genes)$geneStatus) ## [1] KNOWN ## Levels: KNOWN NOVEL PUTATIVE
To leave a comment for the author, please follow the link and comment on their blog: bRogramming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.