Graphical Display of R Package Dependencies
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.
General idea
- Scrap the package names the main cran package web-site
- For each package, scrap the associated web-page and retrieve its dependencies.
For example, ADaCGH has a large number of packages under the “DEPENDS” section.
Pre-processing
To make life easier, I made a few simplifications to the data:
- any dependencies on R, MASS, stats, methods and utils were removed when plotting;
- I removed any bioconductor and omega hat packages;
- version numbers in the DEPENDS section were ignored.
It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″
Results
The top six packages based on the DEPENDS section are:
- lattice – 165 times
- survival – 107
- mvtnorm – 103
- tcltk – 76
- graphics – 76
- grid – 60
You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages). The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.
In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.
I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.
R Details
- To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
- the web-pages were all well formed since they were generated from the package DESCRIPTION file
- I needed practice with regular expressions
- the R code is at the end of this post
- You can download a csv file of the list edges from here
require("stringr") #################### ## Get dependencies #################### getDependencies = function(pkg_name) { url_st = "http://cran.r-project.org/web/packages" url_end = "index.html" url = paste(url_st, pkg_name, url_end, sep="/") cran_web = paste(readLines(url), collapse="") if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1) return() ## Get the table hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web) ## Clean the td & tr tags hrefs = gsub('</td></tr>.*',"", hrefs) ## Remove R from dependencies hrefs = gsub('R .*?<',"<", hrefs) ## Remove versions hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs) ## Remove Bioconductor hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>", "", hrefs) ## Remove Omegahat hrefs = gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>", "", hrefs) ## Get dependencies depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>", "\\1", hrefs) ##Unlist and remove white space depends_on = strsplit(depends_on, ",")[[1]] depends_on = as.vector(sapply(depends_on, str_trim)) depends_on = depends_on[sapply(depends_on, nchar)>0] return(depends_on) } ########### #Main Page url = "http://cran.r-project.org/web/packages/" cran_web_page = paste(readLines(url), collapse="") main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page) main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table) depends_on = gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>', "\\1 ", main_table) cran_packages = unlist(strsplit(depends_on, " ")) from = vector("character", 10000) to = vector("character", 10000) j = 1 for(i in 1:length(cran_packages)) { dependencies = getDependencies(cran_packages[i]) cat(i, ":", dependencies, "\n") if(!is.null(dependencies) && length(dependencies) > 0) { l = length(dependencies) - 1 from[j:(j+l)] = cran_packages[i] to[j:(j+l)] = dependencies j = j + l + 1 } } dep_df = data.frame(from=from, to=to) dep_df = dep_df[1:j,]
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.