finding meaningful clusters in phylogenetic trees or other hierarchical clusterings

Glenn

9 years ago

[This article was first published on Pirate Science » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Phylogenetic trees are a specialization of hierarchical clustering which elegantly capture relatedness between observations, grouping like with like. Yet hierarchical clusterings have one common complaint, as compared to density/distribution based clustering, the ability to classify the data into different types. Finding a meaningful cut of the tree into subtrees remains an open question in the field

The prolific Mattias Prosperi (9 publications in the first 9 months of 2012) has proposed a method for automatically partitioning phylogenetic trees of pathogens into transmission clusters. The intuition is that a group of pathogens represent a transmission cluster if their sequences are monophyletic and more closely related than those from two randomly selected individuals.

To capture this intuition, he proposes a depth first search of the phylogeny. A node is classified as a cluster if the median pairwise patristic distance (MPPD) between the nodes of its subtree is below a threshold. Since phylogenetic trees often come with some measurable uncertainty, he further suggests that the node should have high reliability (bootstrap support, SH test, …).

So let’s do it. Fast and dirty. In R. [The code is available on Figshare, along with sample data.]

We’ll need the ape and geiger packages to work with the phylogenies, and igraph makes it easy to get the nodes in depth first search order.

1
2
3

require(ape, quietly=TRUE)
require(geiger, quietly=TRUE)
require(igraph, quietly=TRUE)

Notice that the algorithm thresholds based on the MPPD, and that identifying a good threshold may require some tuning. We want the MPPD for the subtree of each internal node. Sapply over the vector of nodes? Hardly efficient, completely ignores the tree structure, but fast to code correctly. I did mention fast and dirty, didn’t I? Which also excuses dumping the results into the global environment rather than using encapsulation. If the tree has more than ~ 150 nodes, it is well worth precomputing this vector.

Prosperi suggests the MPPD of ALL nodes in the subtree, but one could alternately consider only leaf nodes.

## Given
##   a node, tree, and distance matrix
## Return
##   median pairwise patristic distance (MPPD) of its leaves
get.node.leaf.MPPD <- function(node,tree,distmat){
  nlist <- node.leaves(tree,node)
  foo <- distmat[nlist,nlist]
  return(median(foo[upper.tri(foo,diag=FALSE)]))
}
 
## Given
##   a node, tree, and distance matrix
## Return
##   median pairwise patristic distance (MPPD) of all of its decendants
get.node.full.MPPD <- function(node,tree,distmat){
  nlist <- node.leaves(tree,node)
  elist <- tree$edge[which.edge(tree,nlist),2]
  foo <- distmat[elist,elist]
  return(median(foo[upper.tri(foo,diag=FALSE)]))
}

and a helper function to call them:

## Given
##   a tree and (optionally) a distance matrix
## Return
##   a vector giving the median pairwise
##   patristic distance of the subtree under
##   each internal node
## SLOW!! May be a good idea to save/cache results
pdist.clusttree <- function(tree,distmat=NULL,mode=c('leaf','all')){
  mode <- match.arg(mode)
  if(is.null(distmat)){
    if(mode=='leaf'){ distmat <-  cophenetic.phylo(tree) }
    else{ distmat <-  dist.nodes(tree) }
  }
  ntips<- Ntip(tree)
  nint <- tree$Nnode ## number of internal nodes
  if(mode=='leaf'){
    return(sapply((ntips+1):(ntips+nint),get.node.leaf.MPPD,tree,distmat))
  }
  else{
    return(sapply((ntips+1):(ntips+nint),get.node.full.MPPD,tree,distmat))
  }
}

The actual clustering is done as follows:

## Given
##  a tree and a threshold, and (optionally)
##  a vector of the MPPD of the subtree at each internal node, 
##  the reliability threshold and vector,
##  and specify if you want the membership for all nodes or
##  only the tips
## Return
##  a vector indicating, for each internal node, which
##  cluster it belongs to
prosperi.cluster <- function(tree,thresh,distvec=NULL,
                             rthresh=NULL,reliabilityvec=NULL,
                             retval=c("tips","all")){
  ## take care of optional arguments
  retval <- match.arg(retval)
  if(is.null(distvec)){
    cat("Calculating MPPD for each node ...\n")
    distvec <- pdist.clusttree(tree,mode='all')
  }
  if(is.null(rthresh) || is.null(reliabilityvec)){
    reliabilityvec=NULL
    cat("No reliability thresholding.\n")
  }
  ## set up clustering
  ntips<-Ntip(tree)
  cnum <- 0 ## cluster number
  assign <- rep(0,ntips+tree$Nnode) ## cluster assignment
  igraph.tree <- graph.edgelist(tree$edge) ## tree in igraph form
  dfs <- graph.dfs(igraph.tree,root=ntips+1,neimode='out',
                   order=TRUE,dist=TRUE)
  ## travese the tree in depth first order
  for(i in 1:length(dfs$order)){
    node <- dfs$order[i]
    ## skip leaves
    if(node < ntips+1){ next }
    ## skip unreliable nodes (if reliability measure is available)
    if(! is.null(reliabilityvec) &&
       reliabilityvec[node-ntips] >= rthresh){ next }
    ## If the node's subtree is below the threshold, mark it and
    ## its subtree as members of a new cluster
    if(distvec[node-ntips]<=thresh && assign[node]<=0){
      cnum <- cnum+1
      subtree <- graph.dfs(igraph.tree,node,
                           neimode='out',unreachable=FALSE)$order
      subtree <- subtree[! is.nan(subtree)]
      assign[subtree] <- cnum
    }}
  ans <- list(membership=assign,allcsize=table(assign),
              leafclustsize=table(assign[1:ntips]),
              ntips=ntips,threshold=thresh)
  if(retval=="tips"){ans$membership <- ans$membership[1:ntips]}
  class(ans) <- c(class(ans),'p.cluster')
  return(ans)
}

And what is the point of it all without a colorful graphic?

testplot <- function(testtree,thresh,plotfile=NULL){
  graphics.off()
  clustering <- prosperi.cluster(testtree,thresh)$membership
  if(class(plotfile)=="character"){ png(plotfile) }
  plot(testtree,show.tip.label=FALSE)
  tiplabels(rep("  ",Ntip(testtree)),
            bg=clustering)  # better colors if you use RColorBrewer
  if(class(plotfile)=="character"){ dev.off() }
}

To test it out, we need some data. You can grab a nice tree from the figshare site if you don’t have your own handy.

108	samp <- read.tree('sampbalanced.tre')

where to threshold?

109
110
111

quantile(dist.nodes(samp),seq(0.1,0.5,by=0.05))
##      10%       15%       20%       25%       30%       35%       40%       45% 
##0.4643290 0.5712468 0.6980250 0.8462455 1.0000370 1.1277150 1.2418286 1.3255760

suggests that 0.46 would restrict clusters to tips which are more closely related than those from 90% of randomly selected individuals. If one wants to be generous, a threshold of 1 clusters individuals more similar than 70% of the population. Here is what it looks like:

112
113

testplot(samp,0.46)
testplot(samp,1.0)

Clustering phylogenetic trees based on median patristic distances less than 10% (left) or 30% (right) of the entire tree

In the example so far, the tree is small enough that we did not need to precompute the MPPD vector. For a larger tree, however, you will want to do so. Assume (for kicks) that the tree also has the bootstrap support value of each internal node stored as a node label for the tree.

bigtree <- read.tree("aBigTree.tre")
bigtree.mppd <- pdist.clusttree(bigtree,mode="all")
bigtree.support <- as.numeric(bigtree$node.label)
bigtree.support[is.na(bigtree.support)] <- 0 
clustering <- prosperi.cluster(bigtree,0.5,distvec=bigtree.mppd,
                               rthresh=0.90,reliabilityvec=bigtree.support)

Analyze the clustering as you like (perhaps by making a table of the membership values, perhaps by plotting) to see if the threshold is reasonable.

Source paper for the algorithm:
Prosperi MC, Ciccozzi M, Fanti I, Saladini F, Pecorari M, Borghi V, Di Giambenedetto S, Bruzzone B, Capetti A, Vivarelli A, Rusconi S, Re MC, Gismondo MR, Sighinolfi L, Gray RR, Salemi M, Zazzi M, De Luca A, & ARCA collaborative group (2011). A novel methodology for large-scale phylogeny partition. Nature communications, 2 PMID: 21610724

Abstract

Understanding the determinants of virus transmission is a fundamental step for effective design of screening and intervention strategies to control viral epidemics. Phylogenetic analysis can be a valid approach for the identification of transmission chains, and very-large data sets can be analysed through parallel computation. Here we propose and validate a new methodology for the partition of large-scale phylogenies and the inference of transmission clusters. This approach, on the basis of a depth-first search algorithm, conjugates the evaluation of node reliability, tree topology and patristic distance analysis. The method has been applied to identify transmission clusters of a phylogeny of 11,541 human immunodeficiency virus-1 subtype B pol gene sequences from a large Italian cohort. Molecular transmission chains were characterized by means of different clinical/demographic factors, such as the interaction between male homosexuals and male heterosexuals. Our method takes an advantage of a flexible notion of transmission cluster and can become a general framework to analyse other epidemics.

To leave a comment for the author, please follow the link and comment on their blog: Pirate Science » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.