Six degrees of Hadley Wickham: The CRAN co-authorship network
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Once upon a time I was a dedicated network scientist. Currently though they are more peripheral in my work and I just like to toy around with interesting datasets. One of those is the CRAN co-authorship network. In a co-authorship network, two individuals (in our case R developers), are connected, if they authored a piece of work together. Here, a “piece of work” is an R package. This network can be assembled quite easily based on the authors field in all DESCRIPTION files of packages available on CRAN. I have done a low level analysis on GitHub, also featured in tidytuesday, including the introduction of the Hadley number, but I always wanted to do a longer write up. And voila, this is said write up.
library(tidyverse) library(igraph) library(netUtils)
Getting the Data
It is actually quite easy to get all metadata (and more!) of the DESCRIPTION files from CRAN. It is a single function call
db <- tools::CRAN_package_db() str(db)
'data.frame': 20296 obs. of 67 variables: $ Package : chr "A3" "AalenJohansen" "AATtools" "ABACUS" ... $ Version : chr "1.0.0" "1.0" "0.0.2" "1.0.0" ... $ Priority : chr NA NA NA NA ... $ Depends : chr "R (>= 2.15.0), xtable, pbapply" NA "R (>= 3.6.0)" "R (>= 3.1.0)" ... $ Imports : chr NA NA "magrittr, dplyr, doParallel, foreach" "ggplot2 (>= 3.1.0), shiny (>= 1.3.1)," ... $ LinkingTo : chr NA NA NA NA ... $ Suggests : chr "randomForest, e1071" "knitr, rmarkdown" NA "rmarkdown (>= 1.13), knitr (>= 1.22)" ... $ Enhances : chr NA NA NA NA ... $ License : chr "GPL (>= 2)" "GPL (>= 2)" "GPL-3" "GPL-3" ... $ License_is_FOSS : chr NA NA NA NA ... $ License_restricts_use : chr NA NA NA NA ... $ OS_type : chr NA NA NA NA ... $ Archs : chr NA NA NA NA ... $ MD5sum : chr "027ebdd8affce8f0effaecfcd5f5ade2" "d7eb2a6275daa6af43bf8a980398b312" "bc59207786e9bc49167fd7d8af246b1c" "50c54c4da09307cb95a70aaaa54b9fbd" ... $ NeedsCompilation : chr "no" "no" "no" "no" ... $ Additional_repositories: chr NA NA NA NA ... $ Author : chr "Scott Fortmann-Roe" "Martin Bladt [aut, cre],\n Christian Furrer [aut]" "Sercan Kahveci [aut, cre]" "Mintu Nath [aut, cre]" ... $ Authors@R : chr NA "c(person(\"Martin\", \"Bladt\", email = \"[email protected]\", role = c(\"aut\", \"cre\")),\n "| __truncated__ "person(\"Sercan\", \"Kahveci\", email = \"[email protected]\", role = c(\"aut\", \"cre\"))" NA ... $ Biarch : chr NA NA NA NA ... $ BugReports : chr NA NA "https://github.com/Spiritspeak/AATtools/issues" NA ... $ BuildKeepEmpty : chr NA NA NA NA ... $ BuildManual : chr NA NA NA NA ... $ BuildResaveData : chr NA NA NA NA ... $ BuildVignettes : chr NA NA NA NA ... $ Built : chr NA NA NA NA ... $ ByteCompile : chr NA NA "true" NA ... $ Classification/ACM : chr NA NA NA NA ... $ Classification/ACM-2012: chr NA NA NA NA ... $ Classification/JEL : chr NA NA NA NA ... $ Classification/MSC : chr NA NA NA NA ... $ Classification/MSC-2010: chr NA NA NA NA ... $ Collate : chr NA NA NA NA ... $ Collate.unix : chr NA NA NA NA ... $ Collate.windows : chr NA NA NA NA ... $ Contact : chr NA NA NA NA ... $ Copyright : chr NA NA NA NA ... $ Date : chr "2015-08-15" NA NA NA ... $ Date/Publication : chr "2015-08-16 23:05:52" "2023-03-01 10:42:09 UTC" "2022-08-12 13:40:09 UTC" "2019-09-20 07:40:06 UTC" ... $ Description : chr "Supplies tools for tabulating and analyzing the results of predictive models. The methods employed are applicab"| __truncated__ "Provides the conditional Nelson-Aalen and Aalen-Johansen estimators. The methods are based on Bladt & Furrer (2"| __truncated__ "Compute approach bias scores using different scoring algorithms,\n compute bootstrapped and exact split-half"| __truncated__ "A set of Shiny apps for effective communication and understanding in statistics. The current version includes p"| __truncated__ ... $ Encoding : chr NA "UTF-8" "UTF-8" "UTF-8" ... $ KeepSource : chr NA NA NA NA ... $ Language : chr NA NA NA NA ... $ LazyData : chr NA NA "true" "true" ... $ LazyDataCompression : chr NA NA NA NA ... $ LazyLoad : chr NA NA NA NA ... $ MailingList : chr NA NA NA NA ... $ Maintainer : chr "Scott Fortmann-Roe <[email protected]>" "Martin Bladt <[email protected]>" "Sercan Kahveci <[email protected]>" "Mintu Nath <[email protected]>" ... $ Note : chr NA NA NA NA ... $ Packaged : chr "2015-08-16 14:17:33 UTC; scott" "2023-02-28 18:01:12 UTC; martinbladt" "2022-08-12 13:12:35 UTC; b1066151" "2019-09-12 14:16:35 UTC; s02mn9" ... $ RdMacros : chr NA NA NA NA ... $ StagedInstall : chr NA NA NA NA ... $ SysDataCompression : chr NA NA NA NA ... $ SystemRequirements : chr NA NA NA NA ... $ Title : chr "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModels" "Conditional Aalen-Johansen Estimation" "Reliability and Scoring Routines for the Approach-Avoidance Task" "Apps Based Activities for Communicating and Understanding\nStatistics" ... $ Type : chr "Package" "Package" "Package" NA ... $ URL : chr NA NA NA "https://shiny.abdn.ac.uk/Stats/apps/" ... $ UseLTO : chr NA NA NA NA ... $ VignetteBuilder : chr NA "knitr" NA "knitr" ... $ ZipData : chr NA NA NA NA ... $ Path : chr NA NA NA NA ... $ X-CRAN-Comment : chr NA NA NA NA ... $ Published : chr "2015-08-16" "2023-03-01" "2022-08-12" "2019-09-20" ... $ Reverse depends : chr NA NA NA NA ... $ Reverse imports : chr NA NA NA NA ... $ Reverse linking to : chr NA NA NA NA ... $ Reverse suggests : chr NA NA NA NA ... $ Reverse enhances : chr NA NA NA NA ...
A lot of data one can do a lot of things with, but we only need to fields. The package name and the authors.
The really hard part is to clean up the authors field. While there exists some standardized ways of entering author names into the DESCRIPTION file, it is still a wild west free-text field. I tried to to the cleaning semi-automatically with a script which was very tideous and I am sure it is not perfect1.
author_pkg_cran <- author_cleaner(db) |> dplyr::filter(!authorsR %in% c("Posit Software", "R Core Team", "R Foundation", "Rstudio", "Company"))
str(author_pkg_cran)
tibble [52,260 × 2] (S3: tbl_df/tbl/data.frame) $ Package : chr [1:52260] "A3" "AalenJohansen" "AalenJohansen" "AATtools" ... $ authorsR: chr [1:52260] "Scott Fortmann-Roe" "Martin Bladt" "Christian Furrer" "Sercan Kahveci" ...
Six Degrees of Hadley Wickham
If you are familiar with the Erdős number number and/or the Bacon number then you know where this is going. Erdős was an incredibly prolific mathematician, publishing more than 1500 papers with a large number of co-authors by travelling the world. In honor of his prolific (and excentric) life, the “Erdős number” was created. This number describes the “collaboration distance” (or the degree of separation) between Paul Erdős and other mathematicians, measured by the authorship of papers. Authors who have written a paper with Erdős have an Erdős number of 1. Mathematicians who have co-authored with those but not Erdős himself have an Erdős number of 2, and so on.2 The same principle has been employed in other domains3, most prominently in the movie industry with the “Six degrees of Kevin Bacon”. The Bacon number shows how far away an actor is from appearing in a movie with Kevin Bacon.
The “Hadley number” can similarly be defined as the distance of R developers to Hadley Wickham in the co-authorship network. Someone (“A”) who developed a package that Hadley is a develeloper of has a Hadley number of 1. Someone who developed a package that A has developed but not Hadley has Hadley number 2, and so on. Hadley himself is the only person with Hadley number 0. Below is the distribution of the Hadley number for all developers in the largest connected component.
The maximum Hadley number is 10 and the average is 3.
To check your own Hadley number (if you are in the largest connected component, and my cleaning script didn’t butcher your name), scroll to the end of this post.4
The center of the collaboration network
Another interesting question in network analytic terms is who the center of the network is. The center is defined as the person who has the smallest average distance to all other developers. The top ten developers in that regard are shown below. The full list can again be explored at the end of this post.
name | centrality |
---|---|
Hadley Wickham | 3.00331 |
Ben Bolker | 3.12498 |
Dirk Eddelbuettel | 3.15679 |
Martin Maechler | 3.20017 |
Romain Francois | 3.20710 |
Michael Friendly | 3.22478 |
Jim Hester | 3.24889 |
Kevin Ushey | 3.25041 |
Duncan Murdoch | 3.28190 |
Yihui Xie | 3.29506 |
Surprise, surprise, it is Hadley again!
Full results
In the below table, you can search for your own Hadley number and where you rank in terms of centrality. If you find any mistakes please do let me know in the comments.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.