Updating the good old Homologene database
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Homologene
I am the author of a small package called homologene. The package wraps the identically named homologene database released by NCBI. This database includes information about genes that are homologues/orthologues of each other. It is structured as a simple table. One can simply access it by doing
library(readr) library(magrittr) library(dplyr) dir.create('homologene-files',showWarnings = FALSE) download.file(url = "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data", destfile = 'homologene-files/homologene.data') homologene = read_tsv('homologene-files/homologene.data', col_names = c('HID', 'Taxonomy', 'Gene.ID', 'Gene.Symbol', 'Protein.GI', 'Protein.Accession')) homologene ## # A tibble: 275,237 x 6 ## HID Taxonomy Gene.ID Gene.Symbol Protein.GI Protein.Accession ## <dbl> <dbl> <dbl> <chr> <dbl> <chr> ## 1 3 9606 34 ACADM 4557231 NP_000007.1 ## 2 3 9598 469356 ACADM 160961497 NP_001104286.1 ## 3 3 9544 705168 ACADM 109008502 XP_001101274.1 ## 4 3 9615 490207 ACADM 545503811 XP_005622188.1 ## 5 3 9913 505968 ACADM 115497690 NP_001068703.1 ## 6 3 10090 11364 Acadm 6680618 NP_031408.1 ## 7 3 10116 24158 Acadm 292494885 NP_058682.2 ## 8 3 7955 406283 acadm 390190229 NP_998175.2 ## 9 3 7227 38864 CG12262 24660351 NP_648149.1 ## 10 3 7165 1276346 AgaP_AGAP005662 58387602 XP_315683.2 ## # ... with 275,227 more rows
In the homologene package, a subset of this data is available for querying.
head(homologene::homologeneData) ## HID Gene.Symbol Taxonomy Gene.ID ## 1 3 acdh-8 6239 173979 ## 2 3 acdh-7 6239 181758 ## 3 5 acdh-12 6239 174180 ## 4 6 kat-1 6239 174161 ## 5 6 T02G5.4 6239 187992 ## 6 13 R04B3.2 6239 180550
At the time I removed the Protein.GI
and Protein.Accession
since my work had no use
for them and limited the information to 7 popular model organisms to so that package
wouldn’t be too bloated.
The package includes functions to translate orthologues between species with a one liner
homologene::homologene(c('ENO2','MOG','GZMH'), inTax = 9606, # human outTax = 10090) # mouse ## 9606 10090 9606_ID 10090_ID ## 1 ENO2 Eno2 2026 13807 ## 2 MOG Mog 4340 17441 ## 3 GZMH Gzmd 2999 14941 ## 4 GZMH Gzme 2999 14942 ## 5 GZMH Gzmg 2999 14944 ## 6 GZMH Gzmf 2999 14943
In general I like using this method because it’s fast, the syntax is simple and it’s easy to extend it at a whim. A single set of gene symbols and IDs are used to identify genes and organism level data is tied to taxonomy identifiers which makes things easier. The obvious alternative to this would be to use biomart
1 which to me, is less intuitive and clunky.
What is wrong with homologene?
Not everyone is happy with using homologene for orthology detection. The database has not been updated since 2014. This means it is not created using the latest annotations. The reference genomes have been getting updates and gene symbols change frequently. For example here you can see a list of gene name changes that happened in a 15 day period at February 9th 2019.
That list above is generated by parsing gene_info.gz file provided by NCBI. I use this file to keep an up to date list of gene symbols and their synonyms. It comes in handy when I use data released by other researchers that only include gene symbols as identifiers. Using the data here, we can identify how many genes in homologene have changed their names, assuming the NCBI ids are the same. I will be using the data in the form it is made available in the geneSyonym package.
library(geneSynonym) library(homologene) library(purrr) # get mouse gene IDs from the mouse synonym data shipped with geneSynonym gene_IDs = names(syno10090) # get mouse genes from the homologene data homologene::homologeneData %>% filter(Gene.ID %in% gene_IDs) -> mouse_genes # get fisrt names (official symbol) from the mouse synonym data syno10090[as.character(mouse_genes$Gene.ID)] %>% strsplit('\\|') %>% map_chr(1) -> first_names # set new names as a column mouse_genes$new_names = first_names mistmatched_new_names = mouse_genes %>% filter(new_names != Gene.Symbol) # how many mismatches nrow(mistmatched_new_names) ## [1] 1418 # take a peek at the first 10 head(mistmatched_new_names, 10) ## HID Gene.Symbol Taxonomy Gene.ID new_names ## 1 93 Epb4.2 10090 13828 Epb42 ## 2 479 Daf2 10090 13137 Cd55b ## 3 660 Gm3934 10090 100042625 Gstp-ps ## 4 664 Gucy1b3 10090 54195 Gucy1b1 ## 5 984 Clca3 10090 23844 Clca1 ## 6 1137 Gbas 10090 14467 Nipsnap2 ## 7 1185 Ict1 10090 68572 Mrpl58 ## 8 1189 I830012O16Rik 10090 667370 Ifit3b ## 9 1194 Cyr61 10090 16007 Ccn1 ## 10 1223 Adrbk1 10090 110355 Grk2
That is 1418 genes that have a different official symbol than they used to in 2014. And that is only in mouse. Considering this only corresponds to 6.7% of all gene symbols, this number isn’t too much but it does mean that we will be missing a few genes if we were using gene symbols for matching. Alas, since I wasn’t wise enough 4 years ago when I started my projects, there are parts of my pipelines that still rely on gene symbol matching.
Most common transformation I make is translating my lists of brain cell type markers that were created using mouse data2 to their human orthologues. I can see which markers are lost due to differences in gene symbols when making this transformation.
# this package includes my marker gene list # github.com/pavlidisLab/markerGeneProfile library(markerGeneProfile) data("mouseMarkerGenesCombined") mouseMarkerGenesCombined %>% unlist %>% unique %>% {.[. %in% mistmatched_new_names$new_names]} ## [1] "Adgrg1" "Hacd2" "Msmo1" "Nat8f4" "Pdzph1" "Arhgap45" ## [7] "Cd300c2" "Fam91a1" "Hacd4" "Siglecf" "Washc2" "Adgre1" ## [13] "Stk26" "Vsir" "Mcub" "Plpp3" "Patj" "Scn2a" ## [19] "Diaph2" "Epb41l2" "Grk2" "Nat8f5" "Naprt" "Nat8f1" ## [25] "Adgre5" "Adgrf5" "Adgrl4" "Map3k20" "Rflnb" "Tns2" ## [31] "Nectin3" "Adgre4" "Cntrl" "Hcar2" "Rp2" "Spata46" ## [37] "Cnmd" "Plpp2" "Ankub1" "Mfsd13b" "Phf24" "Cbarp" ## [43] "Cops9" "Ints6l" "Epb41l1" "Adgrb1"
That is 46 genes which corresponds to a 1.8%. This is probably a more realistic image of real world implications since not all genes are equally important. Many genes that are prone to change are pseudogenes, non-coding transcripts etc. Such genes are less likely to come up in an analysis.
Updating homologene gene IDs
One small thing we can do to improve this situation is to manually replace the old gene symbols with new ones. Assuming the gene IDs are the same, we could just replace the old names with the new. However, gene IDs are also prone to changes3 and there is a whole different file that keeps track of these changes. We need to look into that file and update the gene IDs too.
download.file(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_history.gz", destfile = 'homologene-files/gene_history.gz') gene_history = read_tsv('homologene-files/gene_history.gz', col_names = c('tax_id', 'GeneID', 'Discontinued_GeneID', 'Discontinued_Symbol', 'Discontinue_Date')) gene_history ## # A tibble: 11,165,348 x 5 ## tax_id GeneID Discontinued_GeneID Discontinued_Symbol Discontinue_Date ## <dbl> <chr> <dbl> <chr> <dbl> ## 1 9 - 1246494 repA1 20031113 ## 2 9 - 1246495 repA2 20031113 ## 3 9 - 1246496 leuA 20031113 ## 4 9 - 1246497 leuB 20031113 ## 5 9 - 1246498 leuC 20031113 ## 6 9 - 1246499 leuD 20031113 ## 7 9 - 1246506 yqhA 20031113 ## 8 9 - 1246507 repA2 20031113 ## 9 9 - 1246508 repA1 20031113 ## 10 24 - 24952829 jef 20150716 ## # ... with 11,165,338 more rows
One unverified assumption I am making is that once an ID is discontinued, it won’t be used for another gene. Let’s make sure that is actually the case before doing anything else
# combine all the modern IDs from the geneSynonym package (same species as the homologene package) modern_IDs = list(syno10090, syno10116, syno6239, syno7227, syno7955, syno9544, syno9606) %>% lapply(names) %>% do.call(c,.) gene_history %>% filter(Discontinued_GeneID %in% modern_IDs) ## # A tibble: 0 x 5 ## # ... with 5 variables: tax_id <dbl>, GeneID <chr>, ## # Discontinued_GeneID <dbl>, Discontinued_Symbol <chr>, ## # Discontinue_Date <dbl>
No current ID was ever discontinued. That’s a good sign. I can just trace
the history of discontinued IDs and find the current IDs. Note that when I first tried
this here were 20 IDs here. 20 IDs that were discontinued since I last updated the
geneSynonym
package. Since all of these files area updated nightly it is important
to update the databases at the same time.
discontinued_ids = homologeneData %>% filter(Gene.ID %in% gene_history$Discontinued_GeneID) dim(discontinued_ids) ## [1] 3169 4 unchanged_ids = homologeneData %>% filter(!Gene.ID %in% gene_history$Discontinued_GeneID)
It’s good to see that only a relatively small number of IDs were discontinued. We
can trace their history to get their current IDs (if it exists). To make this process
faster, we will also take a relevant subset of the gene_history
frame.
# get the earliest date where we see one of our ID change events from homologene # since homologene was updated in 2014, this should be no earlier that that date earlierst_date = gene_history %>% filter(Discontinued_GeneID %in% homologeneData$Gene.ID) %$% Discontinue_Date %>% min earlierst_date ## [1] 20100919
This is interesting. Why does homologene includes gene IDs that were discontinued 4 years before it’s creation? How many such IDs are out there
# get all discontinuation events earlier than 2014. Events are coded as integers # in YYYYMMDD format gene_history %>% filter(Discontinue_Date < 20140000 & Discontinued_GeneID %in% homologeneData$Gene.ID) ## # A tibble: 9 x 5 ## tax_id GeneID Discontinued_GeneID Discontinued_Symbol Discontinue_Date ## <dbl> <chr> <dbl> <chr> <dbl> ## 1 9544 700788 702240 DOCK1 20100919 ## 2 9544 - 707717 LOC707717 20110304 ## 3 9544 - 708767 TNNI3K 20110308 ## 4 9544 100423619 711097 MAP4 20130430 ## 5 9544 - 714619 LOC714619 20110326 ## 6 9544 - 714812 MATR3 20110401 ## 7 9544 - 715750 LOC715750 20110325 ## 8 9544 - 720334 MEF2B 20110329 ## 9 9544 712686 100424395 LOC100424395 20100928
Ok it’s not so bad. Not sure what’s the reason behind this but it doesn’t seem like a catastrophic failure.
So now we can filter the gene_history
to only include species that we are interested
in and events that happened after the earliest discontinuation date.
relevant_gene_history = gene_history %>% filter(Discontinue_Date >= earlierst_date & tax_id %in% homologene::taxData$tax_id) dim(relevant_gene_history) ## [1] 96776 5
That’s a much smaller search space. Now we can just trace the history of every single
gene with a different ID than before. I will probably end up including a version of this function to the
geneSynonym
package as pretty much all gene lists are at least a little out of date.
traceID = function(id){ print(id) event = relevant_gene_history %>% filter(Discontinued_GeneID == id) if(nrow(event)>1){ # just in case. if the same ID is discontinued twice, there is a problem... return("multiple events") } while(TRUE){ # see if the new ID is discontinued as well next_event = relevant_gene_history %>% filter(Discontinued_GeneID == event$GeneID) if(nrow(next_event)==0){ # if not, previous ID is the right one return(event$GeneID) } else if(nrow(next_event)>1){ # just in case, if the same ID is discontinued twice, there is a problem... return("multiple events") } else if(nrow(next_event) == 1){ # if the new IDs is discontinued, continue the loop and check if it has a parent event = next_event } } } discontinued_ids$Gene.ID %>% sapply(traceID) -> new_ids
Not the most efficient code probably but it’ll be run as a CRON job in the office machine at around 6 am on Sundays so I don’t really care that much about efficiency.
Lets do a few sanity checks
# are there any ids that had multiple events any(new_ids == "multiple events") ## [1] FALSE # are there any new ids that the geneSynonym package doesn't know about? all(new_ids[new_ids != '-'] %in% modern_IDs) ## [1] TRUE # how many of the discontinued IDs end up having new IDs instead of getting removed length(new_ids[new_ids != '-']) ## [1] 1401 # how many IDs are lost forever length(new_ids[new_ids == '-']) ## [1] 1768
Good.
Updating homologene gene symbols
Since we have the new ids now, we can start building our new homologeneData2
frame
that will be added to the homologene
package. I don’t want to overwrite the original data
because people might use it expecting a perfect match with the NCBI database. Also if
I messed up here I don’t want people to suffer without explicitly choosing to trust me
instead of the actual database.
# create a frame with new ids discontinued_fix = data.frame(HID = discontinued_ids$HID, Gene.Symbol = discontinued_ids$Gene.Symbol, Taxonomy = discontinued_ids$Taxonomy, Gene.ID = new_ids, stringsAsFactors = FALSE) # remove symbols that are discontinued discontinued_fix %<>% filter(Gene.ID != '-') head(discontinued_fix) ## HID Gene.Symbol Taxonomy Gene.ID ## 1 194 LOC100538014 7955 445292 ## 2 402 LOC100535497 7955 550518 ## 3 522 ltbp1 7955 562072 ## 4 592 LOC563696 7955 556619 ## 5 595 LOC101882975 7955 561906 ## 6 639 LOC100537604 7955 557240
Before going further let’s make sure we aren’t doing anything crazy by peeking at the NCBI website for the id of Itbp1 gene we see there.
All looks fine, so we can put everything together and replace the gene symbols as well
# recombine the two lists sort by homology group homologeneData2 = rbind(discontinued_fix,unchanged_ids) %>% arrange(HID) # change the names with the new names modern_symbols = list(syno10090, syno10116, syno6239, syno7227, syno7955, syno9544, syno9606) %>% lapply(function(x){ strsplit(x,split = "\\|") %>% map_chr(1) }) %>% do.call(c,.) modern_frame = tibble(modern_IDs, modern_symbols) new_symbols = modern_frame$modern_symbols[match(homologeneData2$Gene.ID, modern_frame$modern_IDs)] # a sanity check. all genes with different symbols sum(homologeneData2$Gene.Symbol != new_symbols) ## [1] 14964 # another sanity check, differences in mouse. # we did this in the beginningas well # note that the number is higher than before # since the new IDs allow more genes to be changed sum((homologeneData2$Gene.Symbol != new_symbols)[homologeneData2$Taxonomy ==10090]) ## [1] 1443 homologeneData2 %<>% mutate(Gene.Symbol = modern_frame$modern_symbols[match(Gene.ID,modern_frame$modern_IDs)]) head(homologeneData2) ## HID Gene.Symbol Taxonomy Gene.ID ## 1 3 acdh-8 6239 173979 ## 2 3 acdh-7 6239 181758 ## 3 3 CG12262 7227 38864 ## 4 3 acadm 7955 406283 ## 5 3 ACADM 9544 705168 ## 6 3 ACADM 9606 34
The new homologene dataset and associated functions to use them should be on Github in a few days. Then I will set up a CRON job to update it weekly. The CRAN version will have to be perpetually out of date which is somewhat problematic, but it can’t be helped. I may include a function that builds the current version and let users specify the version they want to use. The new syntax will probably look like this:
# translate using the old database homologene::homologene(c('ENO2','MOG','GZMH'), inTax = 9606, outTax = 10090) # translate using the new database homologene::homologene(c('ENO2','MOG','GZMH'), inTax = 9606, outTax = 10090, db = homologene2) # translate using the latest database homologene::update_homologene(path) homologene::homologene(c('ENO2','MOG','GZMH'), inTax = 9606, outTax = 10090, db = path)
Before the next CRAN update I want to include a wrapper for biomaRt that uses the same syntax as homologene to make queries in ensembl so that’ll likely be the next post.
Update:
The features are implemented into the homologene package. The syntax is
homologene(c('Mesd', 'Trp53rka', 'Cstdc4', 'Ifit3b'), inTax = 10090, outTax = 9606, db = homologeneData2)
Also all species in homologene database are in the package now.
Bibliography
1 Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A et al. BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 2005; 21: 3439–3440.
2 Mancarci BO, Toker L, Tripathy SJ, Li B, Rocco B, Sibille E et al. Cross-laboratory analysis of brain cell type transcriptomes with applications to interpretation of bulk tissue data. eNeuro 2017;: ENEURO–0212.
3 Information NC for B, Pike USNL of M8R, MD B, Usa 2. Gene Frequently Asked Questions. National Center for Biotechnology Information (US), 2018.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.