Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is very common to cluster genes based on their expression profiles, and also very common to integrate Gene Ontology to observe the distribution of biological processes, molecular functions and cellular components for a given gene list. But, what if the two in combination? The Gene Ontology distributions across a variety of gene clusters may give us a new insight to find out which specific GO terms may related to our biological problem.
I was inspired by the MCP paper which was also mentioned in my previous post, and developed an R function to get this job done.
Here comes the codes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ClusterProfiles <- function(geneClusters, onto="CC", level=3, orgPackage="org.Hs.eg.db") { require(goProfiles) require(plyr) require(ggplot2) clusterProfile <- llply(geneClusters, as.data.frame(basicProfile), onto=onto, level=level, orgPackage = orgPackage) clusterProfile.df <- ldply(clusterProfile, rbind) colnames(clusterProfile.df) <- c("Cluster", "Description", "GOID", "Frequency") clusterProfile.df <- clusterProfile.df[clusterProfile.df$Frequency !=0,] clusterProfile.df$Description <- as.character(clusterProfile.df$Description) ## un-factor clusterProfile.df <- ddply(clusterProfile.df, .(Description), transform, Percent = Frequency/sum(Frequency), Total = sum(Frequency)) x <- mdply(clusterProfile.df[, c("Description", "Total")], paste, sep=" (") y <- sapply(x[,3], paste, ")", sep="") clusterProfile.df$Description <- y ### label GO Description with gene counts. clusterProfile.df <- clusterProfile.df[, -6] ###drop the *Total* column## mtitle <- paste(onto, "Ontology Distribution", sep = " ") p <- ggplot(clusterProfile.df, aes(x = Cluster, y = Description, size = Percent)) p <- p + geom_point(colour="steelblue") + opts(title = mtitle) + xlab("") + ylab("") p <- p + opts(axis.text.x = theme_text(colour="black", size="11", vjust = 1)) p <- p + opts(axis.text.y = theme_text(colour="black", size="11", hjust = 1)) result <- list(data=clusterProfile.df, p=p) return(result) } |
The input *geneClusters* is a list of clusters which contain gene IDs.
Other parameters can refer to the reference manual of Bioconductor package goProfiles.
I post an example below to illustrate how to use it.
> names(geneClusters) [1] "A" "B" "C" "D" > geneClusters[1] $A [1] 3838 29766 51070 483 56667 5573 8971 10755 389 8531 55905 3024 7169 1595 387 > clust.go.prof = ClusterProfiles(geneClusters) > head(clust.go.prof$data) Cluster Description GOID Frequency Percent 1 B apical part of cell (4) GO:0045177 1 0.2500000 2 C apical part of cell (4) GO:0045177 1 0.2500000 3 D apical part of cell (4) GO:0045177 2 0.5000000 4 C cell body (3) GO:0044297 2 0.6666667 5 D cell body (3) GO:0044297 1 0.3333333 6 C cell division site (1) GO:0032153 1 1.0000000 > clust.go.prof$p
The function return a list which contain the cluster profiles annotated with GO which may be useful for further analysis, and a graph which can plot the GO distributions as shown below.
The dot sizes was based on the percentage of frequency in each row.
I plan to develop the version 2 of this function to let user specify which GO categories they are interested in looking at.
Any comments or suggestions are welcomed.
p.s: My thanks goes to Tal Galili for inviting me to joint R-bloggers. I am very happy to see my post appeared in R-bloggers.
Related Posts
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.