Site icon R-bloggers

Comparing dependencies of popular machine learning packages with `pkgnet`

[This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When looking through the CRAN list of packages, I stumbled upon this little gem:

pkgnet is an R library designed for the analysis of R libraries! The goal of the package is to build a graph representation of a package and its dependencies.

And I thought it would be fun to play around with it. The little analysis I ended up doing was to compare dependencies of popular machine learning packages.


library(pkgnet)
library(tidygraph)
## 
## Attache Paket: 'tidygraph'
## The following object is masked from 'package:stats':
## 
##     filter
library(ggraph)
## Lade nötiges Paket: ggplot2
  1. create the package report with pkgnet::CreatePackageReport
  2. convert the edge (report$DependencyReporter$edges) and node (report$DependencyReporter$nodes) data into a graph object with tidygraph::as_tbl_graph
create_pkg_graph <- function(package_name, DependencyReporter = TRUE) {
  
  report <- CreatePackageReport(pkg_name = package_name)
  
  if (DependencyReporter) {
    graph <- as_tbl_graph(report$DependencyReporter$edges,
                      directed = TRUE,
                      nodes = as.data.frame(report$DependencyReporter$nodes))
  } else {
    graph <- as_tbl_graph(report$FunctionReporter$edges,
                      directed = TRUE,
                      nodes = as.data.frame(report$FunctionReporter$nodes))
  }
  
  return(graph)
}
pkg_list <- c("caret", "h2o", "e1071", "mlr")

Note: I wanted to include other packages, like tensorflow, randomFores, gbm, etc. but for those, pkgnet threw an error:

Error in data.table::data.table(node = names(igraph::V(self$pkg_graph)), : column or argument 1 is NULL

for (pkg in pkg_list) {
  graph <- create_pkg_graph(pkg)
  assign(paste0("graph_", pkg), graph)
}
graph <- graph_caret %>% 
  graph_join(graph_h2o, by = "name") %>%
  graph_join(graph_e1071, by = "name") %>%
  graph_join(graph_mlr, by = "name") %>%
  mutate(color = ifelse(name %in% pkg_list, "a", "b"),
         centrality = centrality_degree(mode = "out"))

The bigger the node labels (package names), the higher their centrality. Seems like the more basic utilitarian packages have the highest centrality (not really a surprise…).

graph %>%
  ggraph(layout = 'nicely') + 
    geom_edge_link(arrow = arrow()) + 
    geom_node_point() +
    geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) +
    theme_graph() +
    scale_fill_brewer(palette = "Set1")

For example, methods and stats are dependencies of caret, mlr and e1071 but not h2o, while utils is a dependency of all four.

graph %>%
  filter(centrality > 1 | color == "a") %>%
  ggraph(layout = 'nicely') + 
    geom_edge_link(arrow = arrow()) + 
    geom_node_point() +
    geom_node_label(aes(label = name, fill = color, size = centrality), show.legend = FALSE, repel = TRUE) +
    theme_graph() +
    scale_fill_brewer(palette = "Set1")

It would of course be interesting to analyse a bigger network with more packages. Maybe someone knows how to get these other packages to work with pkgnet?

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2  ggraph_1.0.1    ggplot2_2.2.1   tidygraph_1.1.0
## [5] pkgnet_0.2.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16         RColorBrewer_1.1-2   plyr_1.8.4          
##  [4] compiler_3.5.0       pillar_1.2.2         formatR_1.5         
##  [7] futile.logger_1.4.3  bindr_0.1.1          viridis_0.5.1       
## [10] futile.options_1.0.1 tools_3.5.0          digest_0.6.15       
## [13] viridisLite_0.3.0    gtable_0.2.0         jsonlite_1.5        
## [16] evaluate_0.10.1      tibble_1.4.2         pkgconfig_2.0.1     
## [19] rlang_0.2.0          igraph_1.2.1         ggrepel_0.7.0       
## [22] yaml_2.1.18          blogdown_0.6         xfun_0.1            
## [25] gridExtra_2.3        stringr_1.3.0        dplyr_0.7.4         
## [28] knitr_1.20           htmlwidgets_1.2      grid_3.5.0          
## [31] rprojroot_1.3-2      glue_1.2.0           data.table_1.10.4-3 
## [34] R6_2.2.2             rmarkdown_1.9        bookdown_0.7        
## [37] udunits2_0.13        tweenr_0.1.5         tidyr_0.8.0         
## [40] purrr_0.2.4          lambda.r_1.2.2       magrittr_1.5        
## [43] units_0.5-1          MASS_7.3-49          scales_0.5.0        
## [46] backports_1.1.2      mvbutils_2.7.4.1     htmltools_0.3.6     
## [49] assertthat_0.2.0     ggforce_0.1.1        colorspace_1.3-2    
## [52] labeling_0.3         stringi_1.1.7        visNetwork_2.0.3    
## [55] lazyeval_0.2.1       munsell_0.4.3

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.