taxadb: A High-Performance Local Taxonomic Database Interface
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dealing with taxonomic inconsistencies within and across datasets is a fundamental challenge of ecology and evolutionary biology. Accounting for species synonyms, taxa splitting and unification is especially important as aggregation of data across time and different data sources becomes increasingly common. One potentially powerful approach for addressing these issues is to resolve scientific names to taxonomic identifiers that follow a consistent taxonomic concept. In such a workflow, data from one of the many taxonomic providers (e.g. Integrated Taxonomic Information System 1, Catalogue of Life 2, National Center for Biological Information 3) is integrated with biodiversity datasets to identify an accepted ID for each name. Multiple tools exist to facilitate this workflow, including R’s taxize package 4, which provides an API interface to taxonomic databases. However, due to the nature of API queries which are slow, limited in scope, and dependent on the current state of the database, it remains difficult to resolve names to a taxonomic authority in quick, reproducible way. taxadb seeks to address these issues using a new approach for interfacing with taxonomic data via a local database of taxonomic providers.
The goal of this post is to illustrate the ease with which taxadb can be integrated into existing data munging workflows, as well as give a taste for the variety of other exploratory question that are facilitated by the database backend infrastructure.
Database backend
taxadb is built around a local database of taxonomic data from seven of the largest taxonomic providers. The tables of this database are standardized across providers and include information on accepted ID’s, synonym mappings, and common names when available. The database is accessible by the user through a variety of database backends. Using a local database interface allows not only for quick queries to retrieve taxon ID’s, but also queries across the whole-database. As taxonomic providers are constantly updating their data, databases will be time stamped and archived allowing for user selection of the desired release for reproducible results.
taxadb framework
taxadb has three main families of functions:
- queries that return vectors:
get_ids()
and it’s complement,get_names()
, - queries that filter the underlying taxonomic data frames:
filter_name()
,filter_rank()
,filter_id()
, andfilter_common()
, - database functions
td_create()
,td_connect()
andtaxa_tbl()
Query functions will trigger the automatic one-time set up of the local database for the chosen provider, but set up can also be triggered manually by td_create()
for one or all providers.
taxadb workflow
taxadb is designed for relatively painless local database setup and easy integration of taxonomic ID’s into existing workflows. For example, the common scenario of merging two different datasets with their own taxonomic approaches, such as matching trait data to data on IUCN status. Here we use snippets of data from the Elton Traits v1.0 database 5 and the IUCN Redlist 6.
status_data <- read_tsv(system.file("extdata", "status_data.tsv", package="taxadb"))
iucn_name | category |
---|---|
Pipile pipile | CR |
Pipile cumanensis | LC |
Pipile cujubi | LC |
Pipile jacutinga | EN |
Megapodius decollatus | LC |
Scleroptila gutturalis | LC |
Margaroperdix madagarensis | LC |
Falcipennis falcipennis | NT |
trait_data <- read_tsv(system.file("extdata", "trait_data.tsv", package="taxadb"))
elton_name | mass |
---|---|
Aburria pipile | 1816.59 |
Aburria cumanensis | 1239.22 |
Aburria cujubi | 1195.82 |
Aburria jacutinga | 1240.96 |
Megapodius reinwardt | 666.34 |
Francolinus levalliantoides | 376.69 |
Margaroperdix madagascariensis | 245.00 |
Catreus wallichii | 1436.88 |
Falcipennis falcipennis | 685.61 |
Falcipennis canadensis | 473.65 |
The common approach in this scenario is to simply join by scientific name:
joined <- full_join(trait_data, status_data, by = c("elton_name" = "iucn_name"))
elton_name | mass | category |
---|---|---|
Aburria pipile | 1816.59 | -- |
Aburria cumanensis | 1239.22 | -- |
Aburria cujubi | 1195.82 | -- |
Aburria jacutinga | 1240.96 | -- |
Megapodius reinwardt | 666.34 | -- |
Francolinus levalliantoides | 376.69 | -- |
Margaroperdix madagascariensis | 245.00 | -- |
Catreus wallichii | 1436.88 | -- |
Falcipennis falcipennis | 685.61 | NT |
Falcipennis canadensis | 473.65 | -- |
Pipile pipile | -- | CR |
Pipile cumanensis | -- | LC |
Pipile cujubi | -- | LC |
Pipile jacutinga | -- | EN |
Megapodius decollatus | -- | LC |
Scleroptila gutturalis | -- | LC |
Margaroperdix madagarensis | -- | LC |
This results in only one match between the two datasets, Falcipennis falcipennis. However, if we resolve names first to taxonomic identifiers, which account for synonyms and taxonomic changes, we see a different story.
First we get ID’s for each dataset:
traits <- trait_data %>% mutate(id = get_ids(elton_name, "col")) status <- status_data %>% mutate(id = get_ids(iucn_name, "col"))
And join on the ID:
joined <- full_join(traits, status, by = "id")
elton_name | iucn_name | mass | category | id |
---|---|---|---|---|
Aburria pipile | Pipile pipile | 1816.59 | CR | COL:35517887 |
Aburria cumanensis | Pipile cumanensis | 1239.22 | LC | COL:35537158 |
Aburria cujubi | Pipile cujubi | 1195.82 | LC | COL:35537159 |
Aburria jacutinga | Pipile jacutinga | 1240.96 | EN | COL:35517886 |
Megapodius reinwardt | -- | 666.34 | -- | COL:35521309 |
Francolinus levalliantoides | -- | 376.69 | -- | COL:35518087 |
Margaroperdix madagascariensis | Margaroperdix madagarensis | 245.00 | LC | COL:35521355 |
Catreus wallichii | -- | 1436.88 | -- | COL:35518185 |
Falcipennis falcipennis | Falcipennis falcipennis | 685.61 | NT | COL:35521380 |
Falcipennis canadensis | -- | 473.65 | -- | COL:35521381 |
-- | Megapodius decollatus | -- | LC | COL:35537166 |
-- | Scleroptila gutturalis | -- | LC | -- |
Now we see that there are many more matches between the datasets than we previously thought. In a workflow without taxonomic identifiers resolving these additional matches would require a significant investment of time as each name would need to be double checked and matched manually.
Database facilitated questions
The local database structure also allows us to ask general questions of the entire database, both across providers or across tables for one provider, that are not possible with the API interface. For example, which provider would be able to resolve the largest number of species names in our dataset?
provider_counts <- trait_data %>% select(elton_name) %>% mutate( gbif = get_ids(elton_name, "gbif"), col = get_ids(elton_name, "col"), itis = get_ids(elton_name, "itis"), ncbi = get_ids(elton_name, "ncbi"), wd = get_ids(elton_name, "wd"), iucn = get_ids(elton_name, "iucn"), ott = get_ids(elton_name, "ott") ) %>% purrr::map_dbl(function(x) sum(!is.na(x))) %>% tibble::enframe("provider", "ID_count")
provider | ID_count |
---|---|
gbif | 10 |
col | 10 |
itis | 10 |
ncbi | 1 |
wd | 4 |
iucn | 0 |
ott | 10 |
Or even more generally which bird families have the most species?
bird_families <- filter_rank(name = "Aves", rank = "class", provider = "col") %>% filter(taxonomicStatus == "accepted", taxonRank=="species") %>% group_by(family) %>% count(sort = TRUE) %>% head()
family | n |
---|---|
Tyrannidae | 401 |
Thraupidae | 374 |
Psittacidae | 370 |
Trochilidae | 338 |
Muscicapidae | 314 |
Columbidae | 312 |
And which species has the most synonyms?
most_synonyms <- taxa_tbl("col") %>% count(acceptedNameUsageID, sort=TRUE) %>% head() %>% collect()
acceptedNameUsageID | n |
---|---|
COL:43082445 | 456 |
COL:43081989 | 373 |
COL:43124375 | 329 |
COL:43353659 | 328 |
COL:43223150 | 322 |
COL:43337824 | 307 |
In addition to facilitating quick and easy incorporation of taxonomic identifiers into standard research workflows, taxadb provides direct access to the underlying database of taxonomic providers. Users can therefore use familiar syntax to ask important exploratory questions of the providers rather than being dependent upon the kinds of queries allowed by an API. By providing both a simple interface to ID’s and the potential for more in depth exploration we hope to encourage improved inclusion and understanding of taxonomic data by the biodiversity community.
For more details on the backend options, providers, and the above examples please see our docs. We also welcome feedback on our manuscript.
Acknowledgements
taxadb was co-developed by Carl Boettiger. The package was greatly improved by the rOpenSci peer review process and reviewers Margaret Siple and Lindsay Platt.
References
- Retrieved [2019], from the Integrated Taxonomic Information System (ITIS) (http://www.itis.gov). ↩
- Roskov Y., Ower G., Orrell T., Nicolson D., Bailly N., Kirk P.M., Bourgoin T., DeWalt R.E., Decock W., Nieukerken E. van, Zarucchi J., Penev L., eds. (2019). Species 2000 & ITIS Catalogue of Life, 2019 Annual Checklist. Digital resource at www.catalogueoflife.org/annual-checklist/2019. Species 2000: Naturalis, Leiden, the Netherlands. ISSN 2405-884X. https://www.catalogueoflife.org/annual-checklist/2019/ ↩
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Yaschenko E, Ye J (2009). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009 Jan;37(Database issue):D5-15. Epub 2008 Oct 21. https://doi.org/10.1093/nar/gkn741 ↩
- Chamberlain S, Szoecs E, Foster Z, Arendsee Z, Boettiger C, Ram K, Bartomeus I, Baumgartner J, O’Donnell J, Oksanen J, Tzovaras BG, Marchand P, Tran V, Salmon M, Li G, Grenié M (2019). taxize: Taxonomic information from around the web. R package version 0.9.9, https://github.com/ropensci/taxize. ↩
- Wilman, H. et al. EltonTraits 1.0: Species-level foraging attributes of the world’s birds and mammals: Ecological Archives E095-178. Ecology 95, 2027–2027 (2014). https://doi.org/10.1890/13-1917.1 ↩
- IUCN 2019. The IUCN Red List of Threatened Species. Version 2019-3. http://www.iucnredlist.org. ↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.