Use case: combining taxize and rgbif
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sure thing….this is just the sort of thing for which rOpenSci is being built.
A colleague of mine recently saw our packages in development and thought, “Hey, that could totally make my life easier.” What was made easier you ask? This was his situation:
He had a list of ca. 1200 species of birds and wanted to first obtain the most current species names before seeking location data for occurrences of all the species.
So what tools do we need for this? We need the packages taxize and rgbif:
- taxize: The taxize package allows you to search taxonomic information across the Universal Biological Indexer and Organizer (uBio), Integrated Taxonomic Information Service (ITIS), Encyclopedia of Life (EOL), the Taxonomic Name Resolution Service (TNRS), and Phylomatic.
- rgbif: The rgbif package allows you to search for and retrieve data from the Global Biodiversity Information Facility.
If you want to run this code, the entire workflow is here, as a GitHub Gist.
First step: check names
Note that we are using a subset of the data in my friend’s actual dataset for brevity here. So 1200 species down to 10 species for our purposes.
Let’s just wrap up all the dirty work into one function called checkname. This function uses a few taxize functions, including get_tsn, and getacceptname.
checkname <- function(name) { # name: scientific name # get taxonomic serial number (TSN) if(class(try(tsn <- get_tsn(name, "sciname", by_="name"), silent = T)) == "try-error") {tsn <- "no_results"} # check accepted name out <- getacceptname(tsn) if(out[[1]] == "no_results") {list("check_spelling", name, "check_spelling", out)} else if(length(out) == 2) {list("new_name", name, as.character(out)[[1]], as.character(out)[[2]])} else if(class(as.numeric(out)) == "numeric") {list("good_name", name, name, out)} }
Nice. Now let’s run our species list through the function checkname using llply function from the plyr package.
ournames <- read.csv("birdlist_ten.csv") itisout <- llply(ournames[, 1], checkname, .progress = "text") # query ITIS |======================================================================================| 100% dfnames <- ldply(itisout, function(x) { # make a data frame of results out_ <- as.data.frame(x) names(out_) <- c("status", "name_old", "name_new", "TSN") out_}) dfnames status name_old name_new TSN 1 check_spelling Agapornis_roseicapillis check_spelling no_results 2 new_name Catharacta_maccormicki Stercorarius maccormicki 660062 3 new_name Catharacta_skua Stercorarius skua 660059 4 good_name Cathartes_aura Cathartes_aura 175265 5 good_name Catharus_bicknelli Catharus_bicknelli 554148 6 good_name Catharus_fuscescens Catharus_fuscescens 179796 7 good_name Catharus_guttatus Catharus_guttatus 179779 8 good_name Catharus_minimus Catharus_minimus 179793 9 good_name Catharus_ustulatus Catharus_ustulatus 179788 10 new_name Ceratogymna_brevis Bycanistes brevis 707796
It looks like we have one name spelled wrong (“check_spelling”), three name replacements (“new_name”), and the remainder checked out just fine with ITIS.
Now we need to remove that one species with the spelling problem for now (although you would fix it of course if it was your project). Then we feed the new species list to queries to GBIF.
p.s. The output from above spits out TSNs too, which you can use to query for more taxonomic information for species through the taxize package.
Second step: get lat/long data
dfnames$gbifname <- gsub("_", " ", dfnames[,3]) # create new name column dfnames # we now have a column of names without the underscore for GBIF search status name_old name_new TSN gbifname 1 check_spelling Agapornis_roseicapillis check_spelling no_results check spelling 2 new_name Catharacta_maccormicki Stercorarius maccormicki 660062 Stercorarius maccormicki 3 new_name Catharacta_skua Stercorarius skua 660059 Stercorarius skua 4 good_name Cathartes_aura Cathartes_aura 175265 Cathartes aura 5 good_name Catharus_bicknelli Catharus_bicknelli 554148 Catharus bicknelli 6 good_name Catharus_fuscescens Catharus_fuscescens 179796 Catharus fuscescens 7 good_name Catharus_guttatus Catharus_guttatus 179779 Catharus guttatus 8 good_name Catharus_minimus Catharus_minimus 179793 Catharus minimus 9 good_name Catharus_ustulatus Catharus_ustulatus 179788 Catharus ustulatus 10 new_name Ceratogymna_brevis Bycanistes brevis 707796 Bycanistes brevis dfnames <- dfnames[-1,] # remove row 1 gbiftestout <- llply(as.list(dfnames[,5]), function(x) occurrencelist(x, coordinatestatus = TRUE, maxresults = 10, latlongdf = TRUE)) gbiftestout[[1]] # here's the data frame of results from one species sciname latitude longitude 1 Stercorarius maccormicki 36.65685 -121.9187 2 Stercorarius maccormicki 36.85800 -122.0910 3 Stercorarius maccormicki 46.89017 -125.0051 4 Stercorarius maccormicki 36.85800 -122.0910 5 Stercorarius maccormicki 36.65685 -121.9187 6 Stercorarius maccormicki 40.76234 -124.2363 7 Stercorarius maccormicki 36.85800 -122.0910 8 Stercorarius maccormicki 36.85800 -122.0910 9 Stercorarius maccormicki 36.85800 -122.0910 10 Stercorarius maccormicki 40.76234 -124.2363 gbiftestout_df <- ldply(gbiftestout, identity) # make a data frame of all results rbind(head(gbiftestout_df), tail(gbiftestout_df)) # look at first and last 6 rows sciname latitude longitude 1 Stercorarius maccormicki 36.65685 -121.9187 2 Stercorarius maccormicki 36.85800 -122.0910 3 Stercorarius maccormicki 46.89017 -125.0051 4 Stercorarius maccormicki 36.85800 -122.0910 5 Stercorarius maccormicki 36.65685 -121.9187 6 Stercorarius maccormicki 40.76234 -124.2363 85 Bycanistes brevis -0.16700 37.3170 86 Bycanistes brevis 0.31700 32.5830 87 Bycanistes brevis -0.16700 37.3170 88 Bycanistes brevis -0.16700 37.3170 89 Bycanistes brevis 0.05000 37.6500 90 Bycanistes brevis 0.05000 37.6500
Beauty! That just saved a lot of time I reckon.
Of course there are many more options within the functions to grab data from GBIF – I only show retrieval of latitude and longitude data for species here.
Third step: make some maps
install.packages("maps") require(ggplot2) try_require("maps") world <- map_data("world") mexico <- subset(world, region=="Mexico") # Make a plot for Stercorarius maccormicki ggplot(world, aes(long, lat)) + geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) + geom_jitter(data = gbiftestout[[1]], aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") + opts(title = "Stercorarius maccormicki") # Make a plot for Catharus guttatus, just in Mexico though ggplot(mexico, aes(long, lat)) + geom_polygon(aes(group = group), fill = "white", color = "gray40", size = .2) + geom_jitter(data = gbiftestout[[6]], aes(longitude, latitude), alpha=0.6, size = 4, color = "blue") + opts(title = "Catharus guttatus")
Here’s the two maps, first for Stercorarius maccormicki, and then for Catharus guttatus
Fourth step: smile and get back to us
Wasn’t that easy? So much better than checking names one by one manually, then retrieving data from GBIF manually, both through web interfaces.
Please tell us here, or on Twitter, what other use cases you can think of!
Again, if you want to run this code, the entire workflow is here, as a GitHub Gist. And the species list is below.
The species list:
genus_species |
Agapornis_roseicapillis |
Catharacta_maccormicki |
Catharacta_skua |
Cathartes_aura |
Catharus_bicknelli |
Catharus_fuscescens |
Catharus_guttatus |
Catharus_minimus |
Catharus_ustulatus |
Ceratogymna_brevis |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.