BLATting the internet: the most frequent gene?

nsaunders

8 years ago

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds.

Laura DeMare asks: what was the most-hit gene?

Most hit gene? APOE? MT @GenomeBrowser We BLATed the Internet! DNA sequences from 40 billion webpages mapped to hg19 goo.gl/7T2d5w—
Laura DeMare (@ldemare) January 23, 2014

First, visit the UCSC Table Browser. For the human genome hg19 build, the relevant group is “Phenotype and Literature”, the track is “Web Sequences” and the table is pubsBingBlat. Check the “genome” button under regions, enter a filename (I chose bingblat.gz, with gzip compression) and then click “get output”.

(notes: there are other “Bing” tables – but pubsBingBlat contains gene symbols, so seems easiest to work with in the first instance. The following analysis could also be done using the UCSC MySQL server.)

Open the file in R. It contains 313 510 rows. Gene symbols are in the last column, “locus”, which contains either one symbol or two separated by a comma.

bingblat <- read.table("bingblat.gz", header = T, sep = "\t",
                       stringsAsFactors = F, comment.char = "", quote = "")
dim(bingblat)
# [1] 313510     27

names(bingblat)
#  [1] "X.bin"       "chrom"       "chromStart"  "chromEnd"    "name"       
#  [6] "score"       "strand"      "thickStart"  "thickEnd"    "reserved"   
# [11] "blockCount"  "blockSizes"  "chromStarts" "tSeqTypes"   "seqIds"     
# [16] "seqRanges"   "publisher"   "pmid"        "doi"         "issn"       
# [21] "journal"     "title"       "firstAuthor" "year"        "impact"     
# [26] "classes"     "locus"

head(bingblat$locus)
[1] "WASH2P,WASH7P" "WASH7P"        "OR4F5"         "OR4F5"        
[5] "OR4F5"         "OR4F5"

If we split locus on comma, then unlist, we’ll get a vector of gene symbols and we can count them up using table. The top 10:

genes <- unlist(strsplit(bingblat$locus, ","))
genes.df <- as.data.frame(table(genes))
head(genes.df[order(genes.df$Freq, decreasing=T),], 10)

#           genes  Freq
# 5308      GAPDH 11096
# 8886   MIR4436A  4341
# 6628      IGHG1  4245
# 186        ACTB  3280
# 242       ADAM6  2466
# 7172   KIAA0125  2135
# 7704  LINC00226  2082
# 4155     ELK2AP  1714
# 15728      TP53  1398
# 6043        HBB  1380

GAPDH, by a long way. You’d guess that this is related to the frequent use of GAPDH as a “housekeeping gene” in assays for differential expression.

Here’s the full list with counts, as a public (I hope) TSV file at Dropbox.

Filed under: bioinformatics, genomics, R, statistics Tagged: blat, microsoft research, ucsc

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.