Darwin to the Rescue: Using Phylogenetic Information to Overcome the Raunkiaeran Shortfall
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Functional trait-based research has a unifying role in ecology, allowing the integration of ecological and evolutionary dynamics across different levels of biological organization and across spatial-temporal scales. The basic rationale underlining functional ecological studies relies on the fact that the role a organism play in an ecosystem is not determined by its taxonomic identity, but rather by its behavioral and ecological characteristics, which can differ within and between species. Thus, functional trait research offers complementary views to the classic taxonomic approach, providing a crucial step forward to reveal mechanisms driving biotic interactions and patterns of community assembly and disassembly, and ecosystem functioning.
Despite its promising role in ecology, trait-based research still very limited by data availability. This limitation is now recognized as one of the most important shortfalls hindering scientific progress in the fields of ecology and evolutionary biology, the so-called Raunkiaeran shortfall. Sampling species traits is both time consuming and expensive . Thus, data imputation has become an inherent part of data processing and analysis in functional ecology. The ecologist’s tool kit of data imputation techniques is quite broad, and encompasses a wide range of statistical techniques, from simply imputing missing values based on average trait values to more elegant statistical tools based on machine learning. Among these procedures, the use of random forests based on phylogenetic information has been a promising one.
Just as with functional ecology, the limitation of our knowledge of the phylogenetic relationship among all living species, aka the Darwinian Shortfall also imposes another great restrain of our understanding of biodiversity. However, over the past decades, ecologists have witnessed a rapid rise in the availability of extensive phylogenies of some of the most important taxonomic groups. Although we are also far from resolving the Darwinian Shortfall, this progress in phylogenetics has proved to be a powerful tool to impute missing data in trait databases until more refined information is collected.
In our recent paper “Using phylogenetic information to impute missing functional trait values in ecological databases“, we evaluated the performance of a Random Forest approach, the missForest algorithm, largely used to impute species trait data, based on phylogenetic information. Originally, the missForest method was not conceived to include phylogenetic information in the imputation process. However, Penone et al. (2014) proposed a new imputation framework that includes phylogenetic information in the form of phylogenetic eigenvectors. Under this framework, phylogenetic distance matrices are submitted to ordination procedures and synthesized in eigenvectors, that represent the evolutionary relationships among species. The first eigenvectors correspond to larger distances among species, expressing divergences closer to the root of the phylogeny. Under this framework, phylogenetic information is included in the missForest algorithm by adding phylogenetic eigenvector as independent variables during the imputation procedure. Although this imputation procedure has been largely used in ecological studies, its performance was yet to be evaluated. In our paper, we devised a simulation experiment to evaluate the performance of this data imputation procedure using different scenarios of missing trait values, phylogenetic signal, and correlation among traits.
In a nutshell, our results suggest that the importance of phylogenetic information to the imputation process depends on the proportion of missing entries, trait phylogenetic conservatism, and the level of correlation among traits. In general, this imputation procedure performs better with datasets comprising highly conserved traits. We also show that under high levels of trait correlation, the performance of the imputation process behaves well, independently of the level of phylogenetic signal or the inclusion or not of the phylogenetic information. Although the phylogenetic based-missForest algorithm seems to be a robust method for trait imputation, it is good to stress that no imputation method will ever replace the value of collected data. However, data imputation will be a lasting and crucial part of data treatment and analysis in functional ecology, until we can overcome the Raunkiaeran shortfall.
Here I provide a simple code to help functional ecologists with incomplete trait datasets and in need of options for data imputation. For this simple example, I used a small humming bird trait dataset, from Vizentin-Bugoni et al. (2020), which is available on GitHub. The full code for our paper is available here.
###Load packages require(geiger) require(PVR) require(phytools) require(missForest) require(ape) require(phytools) require(picante) ####Import data from GitHub #Data comes from Vizentin‐Bugoni et al. "Including rewiring in the estimation of the robustness of mutualistic networks." Methods in Ecology and Evolution 11.1 (2020): 106-116. #If you ever copying webpage direction form GitHub, make sure you select the "raw"option trait.df= read.csv("https://raw.githubusercontent.com/vanderleidebastiani/rewiring/master/DataSetExamples/SantaVirginia/SantaVirginia_dataset_h_morph.csv", header = TRUE,sep=",",row.names=1) ###As there is no missing data, here we will create some missing values at random #You can also use the function missForest::prodNA() from the missforest package for that #Define the number of missing observations miss.val=3 while(sum(is.na(trait.df) == TRUE) < (miss.val)){ trait.df[sample(nrow(trait.df),1), sample(ncol(trait.df),1)] = NA } #Look at the new dataset with missing values trait.df ###Import phylogentic information from GitHub directory #This is a nexus file containing a 100 trees from: Jetz et al. "The global diversity of birds in space and time." Nature 491.7424 (2012): 444-448. #Trees are subsampled and pruned from birdtree.org (on 2021-11-03) tree=read.nexus("https://raw.githubusercontent.com/bastazini/geekcologist/main/birds_phylo.nex") ##Create a consensus phylogenetic tree #p=0.5 specifies that the tree must be "majority rules consensus (MRC)" hb.tree=consensus.edges(tree, consensus.tree=consensus(tree,p=0.5)) consensus_ultra=chronos(hb.tree, lambda=0) # Lambda is the rate-smoothing parameter ##Plot phylogenetic tree plotTree(hb.tree,type="fan") ###Imputing trait data using the phylogenetic information # This is based on our example available on GitHub (https://github.com/vanderleidebastiani/missForestImputation) ### Imputation # First of all, decompose the phylogenetic distance matrix into a set of orthogonal vectors (PVRs) phylo.vectors = PVR::PVRdecomp(hb.tree) # Extract the PVRs pvrs = phylo.vectors@Eigen$vectors # Combine traits and PVRs in data frame trait.dfs.pvrs = cbind(trait.df, pvrs) # Imputation using missForest (note that this function have other arguments, see details in Stekhoven and Buehlmann 2012) RF.imp = missForest::missForest(trait.dfs.pvrs, maxiter = 15, ntree = 100, variablewise = FALSE) # Here it is! Your imputed dataset! trait.dfs.imp = RF.imp$ximp[, seq_len(ncol(trait.df)), drop = FALSE] trait.dfs.imp ## You can access imputation error using Normalized Root-Mean-Square Deviation (NRMSD) # NRMSE ranges from 0 to ≈ 1. NRMSE values ≈ 1 occur when the estimations are poor or when the noise involved is too large RF.imp$OOBerror
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.