Using taxa and metacoder to explore taxonomy of Vancouver’s trees
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
From the last post, Vancouver has several common genus in its collection, such as Prunus and Acer. So rather than analyzing the diversity of Vancouver tree species on a species level, we could, with the help of some R packages, visualize the variety of taxonomic groups.
First, we format the Vancouver tree dataset similar to before to get the proper scientific name. This is the same formatting done in the previous post.
library(dplyr) library(purrr) library(tidyr) # read in trees shp # read in trees df trees_df<-read.csv("StreetTrees_CityWide.csv", stringsAsFactors = F) trees_df<-trees_df[trees_df$LATITUDE > 40 & trees_df$LONGITUDE < -100,] trees_df$SCIENTIFIC_NAME <- paste(trees_df$GENUS_NAME, trees_df$SPECIES_NM) trees_df$GENUS_NAME <- as.character(trees_df$GENUS_NAME) trees_df$SPECIES_NAME <- as.character(trees_df$SPECIES_NAME) species_correction <- function(genus, species){ if(species == "SPECIES"){ return(genus) } else if (length(grep("\\sX$", species)) > 0){ return(paste(genus," x ", unlist(strsplit(species, " "))[[1]])) } else { return(paste(genus, species)) } } trees_df$SCIENTIFIC_NAME <- map2_chr(trees_df$GENUS_NAME, trees_df$SPECIES_NAME, species_correction) head(trees_df) ## TREE_ID CIVIC_NUMBER STD_STREET NEIGHBOURHOOD_NAME ## 1 589 4694 W 8TH AV WEST POINT GREY ## 2 592 1799 W KING EDWARD AV SHAUGHNESSY ## 3 594 2401 W 1ST AV KITSILANO ## 5 596 2428 W 1ST AV KITSILANO ## 6 598 2428 W 1ST AV KITSILANO ## 7 599 1725 BALSAM ST KITSILANO ## ON_STREET ON_STREET_BLOCK STREET_SIDE_NAME ASSIGNED ## 1 BLANCA ST 2400 EVEN N ## 2 W KING EDWARD AV 1700 MED N ## 3 W 1ST AV 2400 ODD N ## 5 W 1ST AV 2400 EVEN N ## 6 W 1ST AV 2400 EVEN N ## 7 W 1ST AV 2400 EVEN N ## HEIGHT_RANGE_ID DIAMETER DATE_PLANTED PLANT_AREA ROOT_BARRIER CURB ## 1 5 32.0 20180128 B N Y ## 2 1 3.0 20100104 30 N Y ## 3 4 14.0 20180128 B N Y ## 5 2 11.0 20180128 B N Y ## 6 2 3.0 20180128 B N Y ## 7 2 13.5 20180128 B N Y ## CULTIVAR_NAME GENUS_NAME SPECIES_NAME COMMON_NAME LATITUDE ## 1 ALNUS RUGOSA SPECKLED ALDER 49.26532 ## 2 ABIES GRANDIS GRAND FIR 49.24952 ## 3 PINUS NIGRA AUSTRIAN PINE 49.27096 ## 5 PRUNUS SARGENTII SARGENT FLOWERING CHERRY 49.27083 ## 6 PRUNUS SARGENTII SARGENT FLOWERING CHERRY 49.27084 ## 7 PICEA ABIES NORWAY SPRUCE 49.27083 ## LONGITUDE SCIENTIFIC_NAME ## 1 -123.2150 ALNUS RUGOSA ## 2 -123.1457 ABIES GRANDIS ## 3 -123.1599 PINUS NIGRA ## 5 -123.1602 PRUNUS SARGENTII ## 6 -123.1605 PRUNUS SARGENTII ## 7 -123.1600 PICEA ABIES
To get taxonomic information, we can take advantage of the rOpenSci package taxa
. This package provides a function lookup_tax_data
which will query NCBI taxonomy. We can then use the package metacoder
to visualize a “heat tree” for datasets of taxon. This package was originally developed for next generation sequencing and metabarcoding in identifying species from DNA. So the “heat tree” has been diverted from its original purpose but still works very well.
# install.packages("taxa", "metacoder") library(taxa) ## Warning: package 'taxa' was built under R version 3.5.1 library(metacoder) ## Warning: package 'metacoder' was built under R version 3.5.1 ## This is metacoder verison 0.3.0 (stable). If you use metacoder for published research, please cite our paper: ## ## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404 ## ## Enter `citation("metacoder")` for a BibTeX entry for this citation.
To save on space, we will make a table containing each species in our Vancouver dataset and the number of times it occurs.
species_df <- as.data.frame(table(trees_df$SCIENTIFIC_NAME)) names(species_df)<- c("SCIENTIFIC_NAME", "count") head(species_df) ## SCIENTIFIC_NAME count ## 1 ABIES 33 ## 2 ABIES BALSAMEA 7 ## 3 ABIES CONCOLOR 7 ## 4 ABIES FRASERI 1 ## 5 ABIES GRANDIS 65 ## 6 ABIES NORDMANNIANA 1
Querying the NCBI servers for information takes a long time depending on the number of request. If you sign up for a NCBI API key (instructions here), and add it to your .Renviron
file which is usually located in your home directory.
We are going to use the taxize
function classification
to obtain the taxonomy of each species. Then, we are inserting the taxonomy for each species back into the trees_df
so each tree has an associated taxonomy in column.
library(taxize) species_classifications <- classification(species_df$SCIENTIFIC_NAME, db = "ncbi", rows = 1)
This creates a list of data frames containing the taxonomy. We can then extract the taxonomy information by iterating over the classification list and drop all species which do not have taxonomy.
class_to_df <- function(x, ranking){ if(!is.na(x)){ return(x$name[x$rank == ranking]) } } # species_df <- species_df[species_df$SCIENTIFIC_NAME != "PINUS THUNBERGII", ] species_df$superkingdom <- as.character(sapply(species_classifications, class_to_df, ranking="superkingdom")) species_df$kingdom <- as.character(sapply(species_classifications, class_to_df, ranking="kingdom")) species_df$phylum <- as.character(sapply(species_classifications, class_to_df, ranking="phylum")) species_df$subphylum <- as.character(sapply(species_classifications, class_to_df, ranking="subphylum")) species_df$subclass <- as.character(sapply(species_classifications, class_to_df, ranking="subclass")) species_df$order <- as.character(sapply(species_classifications, class_to_df, ranking="order")) species_df$family <- as.character(sapply(species_classifications, class_to_df, ranking="family")) species_df$genus <- as.character(sapply(species_classifications, class_to_df, ranking="genus")) species_df$species <- as.character(sapply(species_classifications, class_to_df, ranking="species")) species_df <- species_df[species_df$superkingdom != "NULL", ]
We can then join the taxonomy back to the original data set so each tree has its own taxonomy. We also drop the other columns not related to the taxonomy.
trees_df_class <- left_join(trees_df, species_df) kept_columns <- c('SCIENTIFIC_NAME', 'phylum', 'subphylum', 'order', 'family', 'genus', 'species') trees_df_class <- trees_df_class[, kept_columns] ## write.csv(trees_df_class, "trees_taxa.csv", row.names=F)
Next, we use the parse_tax_data
function to … … parse the taxonomy columns in the data frame into a tax_map
object.
By making a ranks
vector, we tell parse_tax_data
which columns contain the taxonomy and the ranking order.
ranks <- c('phylum', 'subphylum', 'order', 'family', 'genus', 'species') obj <- parse_tax_data(trees_df_class, class_cols = ranks, named_by_rank = T) ## Warning: The following 7305 of 124349 input indexes have `NA` in their classifications: ## 29, 33, 37, 38, 64 ... 124328, 124329, 124342, 124347
The resulting object returned from parse_tax_data
is actually a special R6
object rather than the typical data.frame
or S4 object. But since metacoder
is based on using taxa
, it can directly interface with the R6
object produced and we can produce a “heat tree” without any further manipulation.
set.seed(123) tax_graph <- obj %>% filter_taxa(nchar(taxon_names)!=0) %>% filter_taxa(taxon_ranks == "genus", supertaxa = TRUE) %>% heat_tree(node_label = taxon_names, node_size = n_obs, node_label_size = n_obs, node_color = n_obs, node_size_range = c(0.01, .05), node_color_range = RColorBrewer::brewer.pal(5, "BuGn"), node_label_size_range = c(0.02, 0.05), node_color_axis_label = "Number of Trees") tax_graph
The tax_map
is filtered to only the genus level in order to produce a legible visualization compared to a species level heat tree.
Node size is related to number of observations. The most common genus in Vancouver are Prunus and Acer.
sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 17134) ## ## Matrix products: default ## ## locale: ## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 ## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C ## [5] LC_TIME=English_Canada.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] taxize_0.9.3 metacoder_0.3.0 taxa_0.3.1 tidyr_0.8.1 ## [5] purrr_0.2.5 dplyr_0.7.5 ## ## loaded via a namespace (and not attached): ## [1] zoo_1.8-2 tidyselect_0.2.4 xfun_0.3 ## [4] reshape2_1.4.3 lattice_0.20-35 colorspace_1.3-2 ## [7] htmltools_0.3.6 yaml_2.1.19 rlang_0.2.2 ## [10] pillar_1.2.3 glue_1.2.0 RColorBrewer_1.1-2 ## [13] bindrcpp_0.2.2 foreach_1.4.4 bindr_0.1.1 ## [16] plyr_1.8.4 stringr_1.3.1 ggfittext_0.6.0 ## [19] munsell_0.5.0 blogdown_0.8 gtable_0.2.0 ## [22] codetools_0.2-15 evaluate_0.10.1 labeling_0.3 ## [25] knitr_1.20 parallel_3.5.0 curl_3.2 ## [28] Rcpp_0.12.17 backports_1.1.2 scales_0.5.0 ## [31] jsonlite_1.5 ggplot2_3.0.0 digest_0.6.15 ## [34] stringi_1.1.7 bookdown_0.7.17 grid_3.5.0 ## [37] rprojroot_1.3-2 tools_3.5.0 magrittr_1.5 ## [40] lazyeval_0.2.1 tibble_1.4.2 bold_0.5.0 ## [43] crul_0.5.2 crayon_1.3.4 ape_5.1 ## [46] pkgconfig_2.0.1 data.table_1.11.4 xml2_1.2.0 ## [49] assertthat_0.2.0 rmarkdown_1.10 reshape_0.8.7 ## [52] httr_1.3.1 iterators_1.0.9 R6_2.2.2 ## [55] igraph_1.2.1 nlme_3.1-137 compiler_3.5.0
Notes:
The taxa
package has a function lookup_tax_data
which queries NCBI for and creates a tax_map
object. However, I have problems using tree_df
, possibly due to the large amount of associated data which has to go into the tax_map
object. Additionally, the taxon_ranks
are not created directly when using lookup_tax_data
. The example code for using lookup_tax_data
is below. Also weirdly, lookup_tax_data
breaks when querying for Pinus thunbergii the last three times I tried.
trees_df <- trees_df[trees_df$SCIENTIFIC_NAME != "PINUS THUNBERGII", c("SCIENTIFIC_NAME", "GENUS_NAME")] tax_env <- lookup_tax_data(trees_df, type = "taxon_name", column = "SCIENTIFIC_NAME", database = "ncbi")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.