BLATting the internet: the most frequent gene?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.
Laura DeMare asks: what was the most-hit gene?
Most hit gene? APOE? MT @GenomeBrowser We BLATed the Internet! DNA sequences from 40 billion webpages mapped to hg19 goo.gl/7T2d5w—
Laura DeMare (@ldemare) January 23, 2014
First, visit the UCSC Table Browser. For the human genome hg19 build, the relevant group is “Phenotype and Literature”, the track is “Web Sequences” and the table is pubsBingBlat. Check the “genome” button under regions, enter a filename (I chose bingblat.gz, with gzip compression) and then click “get output”.
(notes: there are other “Bing” tables – but pubsBingBlat contains gene symbols, so seems easiest to work with in the first instance. The following analysis could also be done using the UCSC MySQL server.)
Open the file in R. It contains 313 510 rows. Gene symbols are in the last column, “locus”, which contains either one symbol or two separated by a comma.
bingblat <- read.table("bingblat.gz", header = T, sep = "\t", stringsAsFactors = F, comment.char = "", quote = "") dim(bingblat) # [1] 313510 27 names(bingblat) # [1] "X.bin" "chrom" "chromStart" "chromEnd" "name" # [6] "score" "strand" "thickStart" "thickEnd" "reserved" # [11] "blockCount" "blockSizes" "chromStarts" "tSeqTypes" "seqIds" # [16] "seqRanges" "publisher" "pmid" "doi" "issn" # [21] "journal" "title" "firstAuthor" "year" "impact" # [26] "classes" "locus" head(bingblat$locus) [1] "WASH2P,WASH7P" "WASH7P" "OR4F5" "OR4F5" [5] "OR4F5" "OR4F5"
If we split locus on comma, then unlist, we’ll get a vector of gene symbols and we can count them up using table. The top 10:
genes <- unlist(strsplit(bingblat$locus, ",")) genes.df <- as.data.frame(table(genes)) head(genes.df[order(genes.df$Freq, decreasing=T),], 10) # genes Freq # 5308 GAPDH 11096 # 8886 MIR4436A 4341 # 6628 IGHG1 4245 # 186 ACTB 3280 # 242 ADAM6 2466 # 7172 KIAA0125 2135 # 7704 LINC00226 2082 # 4155 ELK2AP 1714 # 15728 TP53 1398 # 6043 HBB 1380
GAPDH, by a long way. You’d guess that this is related to the frequent use of GAPDH as a “housekeeping gene” in assays for differential expression.
Here’s the full list with counts, as a public (I hope) TSV file at Dropbox.
Filed under: bioinformatics, genomics, R, statistics Tagged: blat, microsoft research, ucsc
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.