Site icon R-bloggers

KEGG Module Enrichment Analysis

[This article was first published on R on Guangchuang Yu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:

  • pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)
  • structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)
  • functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)
  • signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)

KEGG Modules have a much more straightforwared interpretation in many situations and there was a feature request for implementing an enrichment test from clusterProfiler user. Both hypergeometric test and GSEA of KEGG Module are now supported in clusterProfiler. Just like KEGG Pathway Analysis, clusterProfiler accesses latest online data and supports more than 2000 species listed in http://www.genome.jp/kegg/catalog/org_list.html.

To prevent confusing new users who may not fammiliar with KEGG, I created two new functions, enrichMKEGG and gseMKEGG for enrichment test of KEGG Module and keep the original functions, enrichKEGG and gseKEGG for KEGG pathway analysis only.

library(clusterProfiler)

data(geneList)
de <- names(geneList)[1:100]
xx <- enrichMKEGG(de, organism='hsa', minGSSize=1)
head(summary(xx))

##            ID
## M00693 M00693
## M00286 M00286
## M00067 M00067
## M00691 M00691
##                                                                                    Description
## M00693                                                            Cell cycle - G2/M transition
## M00286                                                                            GINS complex
## M00067 Sulfoglycolipids biosynthesis, ceramide/1-alkyl-2-acylglycerol => sulfatide/seminolipid
## M00691                                               DNA damage-induced cell cycle checkpoints
##        GeneRatio BgRatio       pvalue     p.adjust       qvalue
## M00693       3/8 10/1528 0.0000111304 5.565199e-05 1.171621e-05
## M00286       2/8  4/1528 0.0001432508 3.581269e-04 7.539514e-05
## M00067       1/8  2/1528 0.0104472034 1.741201e-02 3.665685e-03
## M00691       1/8  7/1528 0.0361484900 4.518561e-02 9.512761e-03
##              geneID Count
## M00693 9133/890/983     3
## M00286   9837/51659     2
## M00067         7368     1
## M00691         1111     1

yy <- gseMKEGG(geneList)

## [1] "calculating observed enrichment scores..."
## [1] "calculating permutation scores..."
## [1] "calculating p values..."
## [1] "done..."

head(summary(yy))

##            ID                     Description setSize enrichmentScore
## M00337 M00337                Immunoproteasome      15       0.7583644
## M00340 M00340   Proteasome, 20S core particle      13       0.7935026
## M00354 M00354 Spliceosome, U4/U6.U5 tri-snRNP      29       0.6053503
##             NES      pvalue   p.adjust    qvalues
## M00337 2.063359 0.002298851 0.03675214 0.02968961
## M00340 2.047060 0.002409639 0.03675214 0.02968961
## M00354 1.913834 0.002564103 0.03675214 0.02968961

Please refer to vignette for more details.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang Yu.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.