Site icon R-bloggers

Phylogenies in R and Python

[This article was first published on Climate Change Ecology » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the reasons I switched to Python from R is because Python’s phylogenetic capabilities are very well developed, but R is catching up. I’m moving into phylogenetic community ecology, which requires a lot of tree manipulation and calculation of metrics and not so much actual tree construction. Python is excellent at these things and has an excellent module called ETE2. R has a few excellent packages as well, including ape and picante.

I can’t compare and contrast all of the features of R and Python’s phylogenetic capabilities. But since I like making pretty pictures, I thought I’d demonstrate how to plot in both R and Python. I’ll say that making a basic plot is pretty simple in both languages. More complex plots are.. well, more complex. I find that the language of ETE2 is more full featured and better, but it had a pretty steep learning curve. Once you get the hang of it, though, there is nothing you can’t do. More or less.

R’s phylogenetic plotting capabilities are good, but limited when it comes to displaying quantitative data along side it. For example, it’s relatively easy to make a phylogeny where native and introduced species have different colors:


require(picante)

SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

# species cover
fullSpData <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/master_sp_data.csv")
# phylogeny
SERCphylo <- read.tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')
# traits
plantTraits <- read.csv("~/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv")

# Put an underscore in the species names to match with the phylogeny
plantTraits$species <- gsub(' ', '_', plantTraits$species)

#Isolate complete cases of traits
traits <- subset(plantTraits, select = c('species', 'woody', 'introduced', 'SLA', 'seedMass', 'toughness'))
traits <- traits[complete.cases(traits), ]

# Make a phylogeny of species for which traits are present
drops <- SERCphylo$tip.label[!(SERCphylo$tip.label %in% traits$species)]
cleanPhylo <- drop.tip(SERCphylo, drops)

# merge the species with the traits, in the order that they appear in the phylogeny
plotTips <- data.frame('species' = cleanPhylo$tip.label)
plotCols <- merge(plotTips, traits[,c(1,3,4,6)], sort=F)
# make a black/red container
tCols <- c('black', 'red')
# plot the phylogeny, coloring the label black for natives, red for introduced
pT <- plot(cleanPhylo,
show.tip.label = T,
cex = 1,
no.margin = T,
tip.color = tCols[plotCols$introduced + 1],
label.offset = 2)
# put a circle at the tip of each leaf
tiplabels(cex = 0.1, pie = plotCols$introduced, piecol = c('red', 'black'))

Basic R phylogeny

It’s also relatively easy to display trait data alongside it, using another two other packages, but then you lose the ability to color species differently and, in all honesty, to customize the phylogeny in any way.


require(adephylo)
require(phylobase)
sercDat <- phylo4d(cleanPhylo, plotCols)
table.phylo4d(sercDat)

 

 

Python, on the other hand, can do this all in the ETE2 module. The learning curve is a bit steeper, but in all honesty, once you get it down it’s easy and flexible. For example, here’s how to make the first graph above:


import ete2 as ete
import pandas as pd

# load data
traits = pd.read_csv('/Users/Nate/Documents/FIU/Research/Invasion_TraitPhylo/Data/plantTraits.csv')
SERCphylo = ete.Tree('/Users/Nate/Documents/FIU/Research/SERC_Phylo/SERC_Nov1-2013.newick.tre')

#### TRAIT CLEANUP ####
# put an underscore in trait species
traits['species'] = traits['species'].map(lambda x: x.replace(' ', '_'))
# pull out the relevant traits and only keep complete cases
traits = traits[['species', 'introduced', 'woody', 'SLA', 'seedMass', 'toughness']]
traits = traits.dropna()

# next, prune down the traits data
traitsPrune = traits[traits['species'].isin(SERCphylo.get_leaf_names())]

# prune the phylogeny so only species with traits are kept
SERCphylo.prune(traitsPrune['species'], preserve_branch_length = True)

# basic phylogenetic plot
SERCphylo.show()

You can use dictionaries to make a couple of guides that retain the trait info for each species


# guide for color
cols = [['black', 'red'][x] for x in traitsPrune['introduced']]
colorGuide = dict(zip(traitsPrune['species'], cols))
# weights (scaled to 1)
slaGuide = dict(zip(traitsPrune['species'], traitsPrune['SLA']/traitsPrune['SLA'].max()))
toughGuide = dict(zip(traitsPrune['species'], traitsPrune['toughness']/traitsPrune['toughness'].max()))
seedGuide = dict(zip(traitsPrune['species'], traitsPrune['seedMass']/traitsPrune['seedMass'].max()))

Next, you can use node styles to set the basic tree appearance. For example, ETE2 uses thin lines and puts a circle at every node (i.e. split) by default. We can use the traverse function, which just goes through every single node, and set every node to the same style:


# set the base style of the phylogeny with thick lines
for n in SERCphylo.traverse():
style = ete.NodeStyle()
style['hz_line_width'] = 2
style['vt_line_width'] = 2
style['size'] = 0
n.set_style(style)

This code just says “go through every node, make a default style, but change the line width to 2 and the circle size to 0″. The result is that every node has thicker lines and we’ve removed the circle.

We can go through only the final nodes (the leaves) and tell it to strip out the underscore of the species name, paste in on the end of the branch in italic , and make the the color specified in the dictionary above (red if introduced, black if native)


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')

Then, use the treestyle to make a couple of stylistic changes, telling it to apply the layout function, add in some extra spacing between the tips so the phylogeny is readable, and save


ts = ete.TreeStyle()
ts.mode = 'r'
ts.show_leaf_name = False
ts.layout_fn = mylayout
ts.branch_vertical_margin = 4
#ts.force_topology = True
ts.show_scale = False

SERCphylo.render("Python_base.png", w = 1500, units="px", tree_style = ts)

It took a bit more work than R to get this far, but now is the awesome part. We’ve already got a function telling Python to paste a red species name at the end of the branches. We can add in more features, like.. say.. a circle that’s scaled by a trait value by simply adding that to the function. Most of the work is already done. We change the function to:


def mylayout(node):
# If node is a leaf, split the name and paste it back together to remove the underscore
if node.is_leaf():
# species name
temp = node.name.split('_')
sp = temp[0] + ' ' + temp[1]
temp2 = ete.faces.TextFace(sp, fgcolor = colorGuide[node.name], fsize = 18, fstyle = 'italic')
ete.faces.add_face_to_node(temp2, node, column=0)
# make a circle for SLA, weighted by SLA values
sla = ete.CircleFace(radius = slaGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
sla.margin_left = 10
sla.hz_align = 1
ete.faces.add_face_to_node(sla, node, column = 0, position = 'aligned')
# same with toughness
toughness = ete.CircleFace(radius = toughGuide[node.name]*15, color = colorGuide[node.name], style = 'circle')
toughness.margin_left = 40
toughness.hz_align = 1
ete.faces.add_face_to_node(toughness, node, column = 1, position = 'aligned')

The confusing part is that you first have to make a ‘face’ (ete.CircleFace), giving it a radius proportional to the species trait value and color based on its introduced status. Then, we use the margin property (sla.margin_left) to give it some space away from the other objects. Next, use the align property to make it centered (sla.hz_align = 1). The final call is just telling it to actually add the ‘face’, which column to put it in, and where to put it (see the ETE2 tutorial for a guide). Aligned tells it to put it offset from the branch tip so that all circles are in the same spot (rather than being directly at the end of the branch, which could vary). Column just tells it where to put it, once it’s in the aligned position. So now there’s a phylogeny with quantitative trait data, still colored properly. And this is a simple example. The graphs can get much better, depending on what you want to do.

Took me several hours to get this far, because the language is pretty hard to wrap your head around at first. But once you get it, it sets off all kinds of possibilities.

 

 


To leave a comment for the author, please follow the link and comment on their blog: Climate Change Ecology » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.