Site icon R-bloggers

Six degrees of Hadley Wickham: The CRAN co-authorship network

[This article was first published on schochastics - all things R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Once upon a time I was a dedicated network scientist. Currently though they are more peripheral in my work and I just like to toy around with interesting datasets. One of those is the CRAN co-authorship network. In a co-authorship network, two individuals (in our case R developers), are connected, if they authored a piece of work together. Here, a “piece of work” is an R package. This network can be assembled quite easily based on the authors field in all DESCRIPTION files of packages available on CRAN. I have done a low level analysis on GitHub, also featured in tidytuesday, including the introduction of the Hadley number, but I always wanted to do a longer write up. And voila, this is said write up.

library(tidyverse)
library(igraph)
library(netUtils)
< section id="getting-the-data" class="level2">

Getting the Data

It is actually quite easy to get all metadata (and more!) of the DESCRIPTION files from CRAN. It is a single function call

db <- tools::CRAN_package_db()
str(db)
'data.frame':   20296 obs. of  67 variables:
 $ Package                : chr  "A3" "AalenJohansen" "AATtools" "ABACUS" ...
 $ Version                : chr  "1.0.0" "1.0" "0.0.2" "1.0.0" ...
 $ Priority               : chr  NA NA NA NA ...
 $ Depends                : chr  "R (>= 2.15.0), xtable, pbapply" NA "R (>= 3.6.0)" "R (>= 3.1.0)" ...
 $ Imports                : chr  NA NA "magrittr, dplyr, doParallel, foreach" "ggplot2 (>= 3.1.0), shiny (>= 1.3.1)," ...
 $ LinkingTo              : chr  NA NA NA NA ...
 $ Suggests               : chr  "randomForest, e1071" "knitr, rmarkdown" NA "rmarkdown (>= 1.13), knitr (>= 1.22)" ...
 $ Enhances               : chr  NA NA NA NA ...
 $ License                : chr  "GPL (>= 2)" "GPL (>= 2)" "GPL-3" "GPL-3" ...
 $ License_is_FOSS        : chr  NA NA NA NA ...
 $ License_restricts_use  : chr  NA NA NA NA ...
 $ OS_type                : chr  NA NA NA NA ...
 $ Archs                  : chr  NA NA NA NA ...
 $ MD5sum                 : chr  "027ebdd8affce8f0effaecfcd5f5ade2" "d7eb2a6275daa6af43bf8a980398b312" "bc59207786e9bc49167fd7d8af246b1c" "50c54c4da09307cb95a70aaaa54b9fbd" ...
 $ NeedsCompilation       : chr  "no" "no" "no" "no" ...
 $ Additional_repositories: chr  NA NA NA NA ...
 $ Author                 : chr  "Scott Fortmann-Roe" "Martin Bladt [aut, cre],\n  Christian Furrer [aut]" "Sercan Kahveci [aut, cre]" "Mintu Nath [aut, cre]" ...
 $ Authors@R              : chr  NA "c(person(\"Martin\", \"Bladt\", email = \"martinbladt@math.ku.dk\", role = c(\"aut\", \"cre\")),\n             "| __truncated__ "person(\"Sercan\", \"Kahveci\", email = \"sercan.kahveci@sbg.ac.at\", role = c(\"aut\", \"cre\"))" NA ...
 $ Biarch                 : chr  NA NA NA NA ...
 $ BugReports             : chr  NA NA "https://github.com/Spiritspeak/AATtools/issues" NA ...
 $ BuildKeepEmpty         : chr  NA NA NA NA ...
 $ BuildManual            : chr  NA NA NA NA ...
 $ BuildResaveData        : chr  NA NA NA NA ...
 $ BuildVignettes         : chr  NA NA NA NA ...
 $ Built                  : chr  NA NA NA NA ...
 $ ByteCompile            : chr  NA NA "true" NA ...
 $ Classification/ACM     : chr  NA NA NA NA ...
 $ Classification/ACM-2012: chr  NA NA NA NA ...
 $ Classification/JEL     : chr  NA NA NA NA ...
 $ Classification/MSC     : chr  NA NA NA NA ...
 $ Classification/MSC-2010: chr  NA NA NA NA ...
 $ Collate                : chr  NA NA NA NA ...
 $ Collate.unix           : chr  NA NA NA NA ...
 $ Collate.windows        : chr  NA NA NA NA ...
 $ Contact                : chr  NA NA NA NA ...
 $ Copyright              : chr  NA NA NA NA ...
 $ Date                   : chr  "2015-08-15" NA NA NA ...
 $ Date/Publication       : chr  "2015-08-16 23:05:52" "2023-03-01 10:42:09 UTC" "2022-08-12 13:40:09 UTC" "2019-09-20 07:40:06 UTC" ...
 $ Description            : chr  "Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicab"| __truncated__ "Provides the conditional Nelson-Aalen and Aalen-Johansen estimators. The methods are based on Bladt & Furrer (2"| __truncated__ "Compute approach bias scores using different scoring algorithms,\n    compute bootstrapped and exact split-half"| __truncated__ "A set of Shiny apps for effective communication and understanding in statistics. The current version includes p"| __truncated__ ...
 $ Encoding               : chr  NA "UTF-8" "UTF-8" "UTF-8" ...
 $ KeepSource             : chr  NA NA NA NA ...
 $ Language               : chr  NA NA NA NA ...
 $ LazyData               : chr  NA NA "true" "true" ...
 $ LazyDataCompression    : chr  NA NA NA NA ...
 $ LazyLoad               : chr  NA NA NA NA ...
 $ MailingList            : chr  NA NA NA NA ...
 $ Maintainer             : chr  "Scott Fortmann-Roe <scottfr@berkeley.edu>" "Martin Bladt <martinbladt@math.ku.dk>" "Sercan Kahveci <sercan.kahveci@sbg.ac.at>" "Mintu Nath <dr.m.nath@gmail.com>" ...
 $ Note                   : chr  NA NA NA NA ...
 $ Packaged               : chr  "2015-08-16 14:17:33 UTC; scott" "2023-02-28 18:01:12 UTC; martinbladt" "2022-08-12 13:12:35 UTC; b1066151" "2019-09-12 14:16:35 UTC; s02mn9" ...
 $ RdMacros               : chr  NA NA NA NA ...
 $ StagedInstall          : chr  NA NA NA NA ...
 $ SysDataCompression     : chr  NA NA NA NA ...
 $ SystemRequirements     : chr  NA NA NA NA ...
 $ Title                  : chr  "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels" "Conditional Aalen-Johansen Estimation" "Reliability and Scoring Routines for the Approach-Avoidance Task" "Apps Based Activities for Communicating and Understanding\nStatistics" ...
 $ Type                   : chr  "Package" "Package" "Package" NA ...
 $ URL                    : chr  NA NA NA "https://shiny.abdn.ac.uk/Stats/apps/" ...
 $ UseLTO                 : chr  NA NA NA NA ...
 $ VignetteBuilder        : chr  NA "knitr" NA "knitr" ...
 $ ZipData                : chr  NA NA NA NA ...
 $ Path                   : chr  NA NA NA NA ...
 $ X-CRAN-Comment         : chr  NA NA NA NA ...
 $ Published              : chr  "2015-08-16" "2023-03-01" "2022-08-12" "2019-09-20" ...
 $ Reverse depends        : chr  NA NA NA NA ...
 $ Reverse imports        : chr  NA NA NA NA ...
 $ Reverse linking to     : chr  NA NA NA NA ...
 $ Reverse suggests       : chr  NA NA NA NA ...
 $ Reverse enhances       : chr  NA NA NA NA ...

A lot of data one can do a lot of things with, but we only need to fields. The package name and the authors.

The really hard part is to clean up the authors field. While there exists some standardized ways of entering author names into the DESCRIPTION file, it is still a wild west free-text field. I tried to to the cleaning semi-automatically with a script which was very tideous and I am sure it is not perfect1.

author_pkg_cran <- author_cleaner(db) |>
    dplyr::filter(!authorsR %in% c("Posit Software", "R Core Team", "R Foundation", "Rstudio", "Company"))
str(author_pkg_cran)
tibble [52,260 × 2] (S3: tbl_df/tbl/data.frame)
 $ Package : chr [1:52260] "A3" "AalenJohansen" "AalenJohansen" "AATtools" ...
 $ authorsR: chr [1:52260] "Scott Fortmann-Roe" "Martin Bladt" "Christian Furrer" "Sercan Kahveci" ...
< section id="the-co-authorship-network" class="level2">

The co-authorship network

The code below is used to build the co-authorship network as a weighted network. The weight shows how many packages two developers have authored together.

author_pkg_cran_net <- netUtils::bipartite_from_data_frame(author_pkg_cran, "authorsR", "Package")
A <- as_biadjacency_matrix(author_pkg_cran_net, sparse = TRUE)
A <- as(A, "sparseMatrix")
B <- Matrix::t(A) %*% A
auth_auth_net <- graph_from_adjacency_matrix(B, "undirected", diag = FALSE, weighted = TRUE)
auth_auth_net
IGRAPH 9519baa UNW- 28836 145114 -- 
+ attr: name (v/c), weight (e/n)
+ edges from 9519baa (vertex names):
 [1] Scott Fortmann-Roe--Clement Calenge    
 [2] Martin Bladt      --Christian Furrer   
 [3] Martin Bladt      --Alexander Mcneil   
 [4] Martin Bladt      --Jorge Yslas        
 [5] Martin Bladt      --Alaric Muller      
 [6] Sigbert Klinke    --Jaroslav Myslivec  
 [7] Sigbert Klinke    --Robert King        
 [8] Sigbert Klinke    --Benjamin Dean      
+ ... omitted several edges

To check if this is a connected network (there is a path connecting any pair of developers), we use the igraph::components() function.

comps_cran <- components(auth_auth_net)
comps_cran$no
[1] 5869

Thats quite a big number of components but it is not really surprising. Many package authors (or teams of authors) have only ever worked on one package (actually more than 40% of all packages are single-authored) and thus never interacted with the broader R developer community on any other package.

The biggest component can be extracted with the igraph::largest_component().

auth_auth_net_largest <- largest_component(auth_auth_net)
auth_auth_net_largest
IGRAPH 958b5ed UNW- 15722 126756 -- 
+ attr: name (v/c), weight (e/n)
+ edges from 958b5ed (vertex names):
 [1] Scott Fortmann-Roe--Clement Calenge    
 [2] Martin Bladt      --Christian Furrer   
 [3] Martin Bladt      --Alexander Mcneil   
 [4] Martin Bladt      --Jorge Yslas        
 [5] Martin Bladt      --Alaric Muller      
 [6] Sigbert Klinke    --Jaroslav Myslivec  
 [7] Sigbert Klinke    --Robert King        
 [8] Sigbert Klinke    --Benjamin Dean      
+ ... omitted several edges

From the 28,836 recorded package authors, 15,722 (54.52%) are part of the largest connected component. All subsequent analyses will be done with this network.

Plot of the biggest component of the CRAN co-authorship network

On average, every developer in the largest component has 16.12 co-authors. The median is 6. The two individuals who coauthored the most packages together (21), are Hadley Wickham and Jim Hester. The person with the most co-authors (757) is Hadley Wickham. What a great transition for the next section.

< section id="six-degrees-of-hadley-wickham" class="level2">

Six Degrees of Hadley Wickham

If you are familiar with the Erdős number number and/or the Bacon number then you know where this is going. Erdős was an incredibly prolific mathematician, publishing more than 1500 papers with a large number of co-authors by travelling the world. In honor of his prolific (and excentric) life, the “Erdős number” was created. This number describes the “collaboration distance” (or the degree of separation) between Paul Erdős and other mathematicians, measured by the authorship of papers. Authors who have written a paper with Erdős have an Erdős number of 1. Mathematicians who have co-authored with those but not Erdős himself have an Erdős number of 2, and so on.2 The same principle has been employed in other domains3, most prominently in the movie industry with the “Six degrees of Kevin Bacon”. The Bacon number shows how far away an actor is from appearing in a movie with Kevin Bacon.

The “Hadley number” can similarly be defined as the distance of R developers to Hadley Wickham in the co-authorship network. Someone (“A”) who developed a package that Hadley is a develeloper of has a Hadley number of 1. Someone who developed a package that A has developed but not Hadley has Hadley number 2, and so on. Hadley himself is the only person with Hadley number 0. Below is the distribution of the Hadley number for all developers in the largest connected component.

The maximum Hadley number is 10 and the average is 3.

To check your own Hadley number (if you are in the largest connected component, and my cleaning script didn’t butcher your name), scroll to the end of this post.4

< section id="the-center-of-the-collaboration-network" class="level2">

The center of the collaboration network

Another interesting question in network analytic terms is who the center of the network is. The center is defined as the person who has the smallest average distance to all other developers. The top ten developers in that regard are shown below. The full list can again be explored at the end of this post.

name centrality
Hadley Wickham 3.00331
Ben Bolker 3.12498
Dirk Eddelbuettel 3.15679
Martin Maechler 3.20017
Romain Francois 3.20710
Michael Friendly 3.22478
Jim Hester 3.24889
Kevin Ushey 3.25041
Duncan Murdoch 3.28190
Yihui Xie 3.29506

Surprise, surprise, it is Hadley again!

< section id="full-results" class="level2">

Full results

In the below table, you can search for your own Hadley number and where you rank in terms of centrality. If you find any mistakes please do let me know in the comments.

To leave a comment for the author, please follow the link and comment on their blog: schochastics - all things R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version