iRefR – PPI data access from R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I use a lot of protein-protein interaction (PPI) data as biological networks represent the systems within which our genes and proteins of interest function. There are many sources of PPI data, including BioGRID and IntAct etc. A recent effort has emerged that attempts to pull all of these databases together to provide a single standardised access point to PPI data; iRefIndex. Currently they integrate data from 13 different PPI databases. Why are there 13 in the first place? Because each has their own biological area of interest or specific criteria for curation (manual vs text-mined etc). This post is an opportunity for me to have a play with the R package for iRefIndex (iRefR).
The following code downloads the current version of iRefIndex for human (other species are available):
> library("iRefR")
> library(stringr)
> iref = get_irefindex(tax_id="9606",iref_version="12.0",data_folder="/path/to/save/iRefIndex")
The resulting object is a dirty great 250Mb data frame in MITAB format, or PSI-MITAB2.6 to be exact, the format specs of which can be found here. This seems a little unwieldily to hold in memory to me, so eventually i’ll subset it to something more manageable;
> print(object.size(iref),units="Mb")
248.8 Mb
First though, some summary stats on the info contained within. The data contains a total of 533,551 interactions, 248,215 of which are unique;
> dim(iref)
[1] 533551 54
Duplications in the data arise when an interaction is taken from multiple source databases or papers etc. The irigrid column contains a unique identifier where any given pair of interactants will always have the same identifier.
> length(unique(iref$irigid))
[1] 248215
iRefIndex counts as human any interaction where just one member is a human protein; e.g. if a viral protein interacts with a human protein this is counted as human. Here i’m just interested in human-human interactions, so;
> human_human_list = data.frame(iref$taxa,iref$taxb)
> tmp = do.call(`paste`, c(unname(human_human_list), list(sep=".")))
> iref_human = iref[tmp == "taxid:9606(Homo sapiens).taxid:9606(Homo sapiens)" | tmp == "-.taxid:9606(Homo sapiens)",]
> dim(iref_human)
[1] 489796 54
> length(unique(iref_human$irigid))
[1] 220435
To subset the data frame to a more memory friendly set of data i’m going to keep just a two column data frame of protein names for each unique interaction. As I’m going to plot some graphs later I also want to keep the biologist friendly HUGO name for each protein, which involves a bit of fiddling as this is not a field in its own right within MITAB, enter stringr:
> mA = str_locate(iref_human$aliasA, perl("hgnc:.*?\\|"))
> hugoA = str_sub(iref_human$aliasA,mA[,1]+5,mA[,2]-1)
> mB = str_locate(iref_human$aliasB, perl("hgnc:.*?\\|"))
> hugoB = str_sub(iref_human$aliasB,mB[,1]+5,mB[,2]-1)
> x = data.frame(iref_human$X.uidA,iref_human$uidB,hugoA,hugoB,iref_human$irigid)
> colnames(x) = c("uidA","uidB","hugoA","hugoB","irigid")
> dim(x)
[1] 489796 4
> head(x)
uidA uidB hugoA hugoB irigid
1 uniprotkb:O75554 uniprotkb:Q07666 WBP4 KHDRBS1 1139443
2 uniprotkb:Q13425 uniprotkb:B7Z6M3 SNTB2 DGKZ 1650951
3 uniprotkb:O75554 uniprotkb:Q8N684-3 WBP4 CPSF7 668917
4 uniprotkb:P60468 uniprotkb:Q9H0F7 SEC61B ARL6 658508
5 uniprotkb:O75554 uniprotkb:Q15233-2 WBP4 NONO 1338471
6 uniprotkb:O95816 uniprotkb:Q66LE6 BAG2 PPP2R2D 1478060
> length(unique(x$irigid))
[1] 220435
> print(object.size(x),units="Mb")
15.5 Mb
Now I want to build a PPI network from this data using iGraph. First, we hack together a data frame of node annotations, in this case the hugo name associated with each unique protein identifier. We then combine this with the interaction data into an igraph object:
> v = unique(data.frame(c(as.character(x$uidA),as.character(x$uidB)),c(as.character(x$hugoA),as.character(x$hugoB))))
> colnames(v) = c("uid","hugo")
> ppi.graph = graph.data.frame(x[,c(1:2,5)],vertices=v,directed=F)
> ppi.graph
IGRAPH UN-- 31476 489796 --
+ attr: name (v/c), hugo (v/c), irigid (e/n)
Our graph (or network) has 31,476 nodes (proteins) and 489,796 edges (or interactions). Each node has two annotations; the uid in “name” and the hugo symbol. Each edge has one annotation, the irigid.
Now, say we do an experiment and identify a bunch of genes/proteins and we want to see if they interact with each other at the protein level. For example this might be the output of a gene expression experiment or all genes genetically associated with a disease. Here, for simplicity, i’ll just use a random selection of proteins from the network.
> myGenes = sample(as.character(v$uid),10)
> myGenes = myGenes[!is.na(myGenes)]
We can now extract a subgraph containing our experimentally identified proteins and those that connect them together. In the following code we first get all of the neighbours (adjoining nodes) of our genes from the main graph. We define order = 1 to specify that we will allow a distance of 1 interaction from our genes. We then subset the edges from the main graph to those where both nodes are in our index of neighbours. This gives us just the interactions between our genes and their immediate neighbours. We then build a graph in the same manner as above, using the edge list and a vertice meta-data data-frame.
> order = 1
> edges = get.edges(ppi.graph, 1:(ecount(ppi.graph)))
> neighbours.vid = unique(unlist(neighborhood(ppi.graph,order,which(V(ppi.graph)$name %in% myGenes))))
> rel.vid = edges[intersect(which(edges[,1] %in% neighbours.vid), which(edges[,2] %in% neighbours.vid)),]
> neighbour.names = data.frame(V(ppi.graph)[neighbours.vid]$name,V(ppi.graph)[neighbours.vid]$hugo, stringsAsFactors=F)
> names(neighbour.names) = c("name","hugo")
> rel = as.data.frame(cbind(V(ppi.graph)[rel.vid[,1]]$name, V(ppi.graph)[rel.vid[,2]]$name), stringsAsFactors=F)
> names(rel) = c("from","to")
> subgraph = graph.data.frame(rel, directed=F, vertices=neighbour.names)
> subgraph = simplify(subgraph)
> subgraph
IGRAPH UN-- 81 239 --
+ attr: name (v/c), hugo (v/c)
And finally, we plot the network with red nodes to indicate proteins from our experimentally identified list of genes and yellow for the rest.
ind = which(V(subgraph)$name %in% myGenes)
cols = rep("yellow",vcount(subgraph))
cols[ind] = "red"
plot(subgraph,layout=layout.fruchterman.reingold,vertex.size=5,vertex.color=cols,vertex.label=NA)
It’s at this point that the fun really starts from a biology point of view – we can mine our network for key hubs, overlay our favourite experimental data or compare to it networks for other sets of input genes, etc. The possibilities are endless, and I for one feel that all of our experimental data should be interpreted within the context of the interaction network. Of course there are limitations; our networks will likely never be complete as we only have PPI data for proteins that have been studied. But, having a human interactome with nigh on 500,000 interactions is a good start…
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.