Most Starred R Packages on GitHub
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It seems like all the best R packages proudly use GitHub and have a README adorned with badges across the top. The recent Microsoft acquisition of GitHub got me wondering: What proportion of current R packages use GitHub? Or at least refer to it in the URL of the package description. Also, what is the relationship between the number of CRAN downloads and the number of stars on a repository? My curiosity got the best of me so I hastily wrote a script to pull the data. Click here to go straight to the full script and data included at the bottom of this post. I acknowledge there are more elegant ways to have coded this, but let’s press on.
Pulling List of Packages & their Details
CORRECTION: I orginally posted this article with code that scraped CRAN to get the list of packages and their metadata. This could violate CRAN Terms and Conditions and generally isn’t responsible. Thanks to Maëlle Salmon (@ma_salmon) I was reminded of how to scrape responsibly and that there is a function, tools::CRAN_package_db()
, that returns all the metadata for the current packages on CRAN. I have written another blog post entitled Scraping Responsibly with R that details how to use the robotstxt package from rOpenSci to check a domain’s directive on bots. Without scraping CRAN here is how you get the package metadata:
pkgs <- tools::CRAN_package_db() # remove duplicate MD5sum column since tibbles can't handle duplicate column names pkgs <- pkgs[,unique(names(pkgs))] pkgs %>% select(Package, Version, Author, BugReports, URL) %>% as.tibble() #> # A tibble: 12,662 x 5 #> Package Version Author BugReports URL #> <chr> <chr> <chr> <chr> <chr> #> 1 A3 1.0.0 Scott Fortmann-Roe <NA> <NA> #> 2 abbyyR 0.5.4 Gaurav Sood [aut, cre] http://github… http://gith… #> 3 abc 2.1 "Csillery Katalin [aut… <NA> <NA> #> 4 abc.data 1.0 "Csillery Katalin [aut… <NA> <NA> #> 5 ABC.RAP 0.9.0 Abdulmonem Alsaleh [cr… <NA> <NA> #> 6 ABCanalysis 1.2.1 Michael Thrun, Jorn Lo… <NA> https://www… #> 7 abcdeFBA 0.4 Abhilash Gangadharan, … <NA> <NA> #> 8 ABCoptim 0.15.0 "George Vega Yon [aut,… <NA> http://gith… #> 9 ABCp2 1.2 "M. Catherine Duryea, … <NA> <NA> #> 10 abcrf 1.7 "Jean-Michel Marin [au… <NA> <NA> #> # ... with 12,652 more rows
From there I looked at the package description fields "URL"
and "BugReports"
to see if either contained “github.com”. It turns out that 3,718 of the packages (29.3% of the total) referenced GitHub. After retrieving the package metadata I pinged the GitHub API to see if I could get the number of stars for the repository. Currently, GitHub allows 5,000 authenticated requests per hour (link), but out of all the packages only 3,718 referenced GitHub, so I could make all the requests at once. Here is the function I used to take a cleaned up version of the package’s URL then form a request to the GitHub API to get star counts:
# get the star count from a clean version of the package's URL gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) }
Analyzing the Data
Once I had all the package detail data, I found that R packages, on average, have 35.7 GitHub stars, but the median number of stars is only 6! ggplot2 has the most stars with 3,174. In my analysis I removed the xgboost, h2o, and feather packages which point to the repository of their implementations in many languages, not just R.
What I really found interesting was comparing CRAN downloads to GitHub repo stars. Using the cranlogs package I was able to get the total package downloads dating back to January 1, 2014. In contrast with the low star counts, the median downloads for R packages is 8,975. Combining stars and downloads data I found that the median R package has 903 downloads per star. Only 38.7% of packages had more than 10 stars, which shows how hard stars are to get even if you’ve written a great package. I’m not sure what proportion of R users frequently reference and contribute to GitHub, but it would be interesting to compare that with the high ratios of downloads to stars.
There are some real outliers in the data. For example, the Rcpp package, perhaps the most downloaded package of all-time, has 15.8M downloads and only 377 stars. Similarly, Hadley’s scales package has 9.4M downloads and only 115 stars. These support/helper packages just don’t get the same star love as the headliners like ggplot2, shiny, and dplyr.
Of course, I could not help but check out the stats for some of the most prolific package authors. After parsing out individuals with the ["aut", "cre"]
roles I came to the not so surprising conclusion that Hadley has the most stars of any author with 12,408 stars. In contrast, Dirk Eddelbuettel had one of the lowest star-to-download ratios. For every ~38K downloads Dirk’s repositories will receive one star. Pay no attention to the man behind the curtain since his Rcpp package underpins a whole host of packages without all the GitHub fanfare. Here is a list of popular R package authors and their stats:
Author | Notable Packages | Downloads | Stars | Downloads Per Star |
---|---|---|---|---|
Hadley Wickham | ggplot2, dplyr, httr | 113,160,314 | 12,408 | 9,119.9 |
Dirk Eddelbuettel | Rcpp, BH | 28,433,586 | 745 | 38,165.9 |
Yihui Xie | knitr, rmarkdown, bookdown | 42,472,860 | 6,315 | 6,725.7 |
Winston Chang | R6, shiny | 17,161,005 | 4,027 | 4,261.5 |
Jennifer Bryan | readxl, gapminder, googlesheets | 6,055,774 | 1,714 | 3,533.1 |
JJ Allaire | rstudioapi, reticulate, tensorflow | 8,882,553 | 2,798 | 3,174.6 |
Jeroen Ooms | jsonlite, curl, openssl | 25,907,868 | 1,483 | 17,469.9 |
Scott Chamberlain | geojsonio, taxize | 1,770,664 | 2,528 | 700.4 |
Jim Hester | devtools, memoise, readr | 22,867,071 | 4,332 | 5,278.6 |
Kirill Müller | tibble, DBI | 36,159,009 | 1,077 | 33,573.8 |
I’m sure you could create mixed models to determine the unique download to star relationship for individuals. Also, you could use other package attributes to predict stars or downloads, but I’ll leave that to another curious soul. I will include tables below regarding the top 10 most downloaded, most starred, most and least downloaded per star.
Appendix
Top 10 Most Starred Packages
Name | Author | Downloads | Stars | Downloads Per Star |
---|---|---|---|---|
ggplot2 | Hadley Wickham | 13,001,703 | 3,174 | 4,096.3 |
shiny | Winston Chang | 4,571,794 | 2,902 | 1,575.4 |
dplyr | Hadley Wickham | 8,276,844 | 2,408 | 3,437.2 |
devtools | Jim Hester | 5,536,730 | 1,645 | 3,365.8 |
knitr | Yihui Xie | 7,131,564 | 1,581 | 4,510.8 |
data.table | Matt Dowle | 6,005,795 | 1,457 | 4,122.0 |
plotly | Carson Sievert | 1,195,880 | 1,255 | 952.9 |
rmarkdown | Yihui Xie | 5,432,495 | 1,160 | 4,683.2 |
tensorflow | JJ Allaire | 94,856 | 1,033 | 91.8 |
bookdown | Yihui Xie | 126,586 | 1,009 | 125.5 |
Top 10 Most Downloaded Packages with Stars
Name | Author | Downloads | Stars | Downloads Per Star |
---|---|---|---|---|
Rcpp | Dirk Eddelbuettel | 15,824,781 | 377 | 41,975.5 |
ggplot2 | Hadley Wickham | 13,001,703 | 3,174 | 4,096.3 |
stringr | Hadley Wickham | 11,547,828 | 268 | 43,088.9 |
stringi | Marek Gagolewski | 11,310,113 | 122 | 92,705.8 |
digest | Dirk Eddelbuettel with contributions by Antoine Lucas | 11,233,244 | 42 | 267,458.2 |
plyr | Hadley Wickham | 10,340,396 | 470 | 22,000.8 |
R6 | Winston Chang | 9,993,128 | 212 | 47,137.4 |
reshape2 | Hadley Wickham | 9,582,245 | 173 | 55,388.7 |
scales | Hadley Wickham | 9,380,757 | 115 | 81,571.8 |
jsonlite | Jeroen Ooms | 9,112,790 | 176 | 51,777.2 |
Top 10 Packages by Stars per Download (frequently starred)
Name | Author | Downloads | Stars | Downloads Per Star |
---|---|---|---|---|
r2d3 | Javier Luraschi | 416 | 235 | 1.77 |
workflowr | John Blischak | 448 | 169 | 2.65 |
goodpractice | Hannah Frick | 523 | 192 | 2.72 |
xtensor | Johan Mabille | 2,057 | 664 | 3.10 |
scico | Thomas Lin Pedersen | 185 | 59 | 3.14 |
shinytest | Winston Chang | 418 | 113 | 3.70 |
furrr | Davis Vaughan | 724 | 171 | 4.23 |
pkgdown | Hadley Wickham | 1,589 | 332 | 4.79 |
rtika | Sasha Goodman | 168 | 32 | 5.25 |
mindr | Peng Zhao | 2,051 | 368 | 5.57 |
Bottom 10 Packages by Stars per Download (infrequently starred)
Name | Author | Downloads | Stars | Downloads Per Star |
---|---|---|---|---|
mime | Yihui Xie | 7,398,765 | 12 | 616,563.8 |
pkgmaker | Renaud Gaujoux | 1,228,173 | 2 | 614,086.5 |
rngtools | Renaud Gaujoux | 1,224,959 | 2 | 612,479.5 |
magic | Robin K. S. Hankin | 344,741 | 1 | 344,741.0 |
gsubfn | G. Grothendieck | 675,056 | 2 | 337,528.0 |
bindrcpp | Kirill Müller | 2,996,452 | 10 | 299,645.2 |
plogr | Kirill Müller | 3,343,099 | 12 | 278,591.6 |
digest | Dirk Eddelbuettel with contributions by Antoine Lucas | 11,233,244 | 42 | 267,458.2 |
munsell | Charlotte Wickham | 7,778,712 | 31 | 250,926.2 |
proto | Hadley Wickham | 2,593,246 | 11 | 235,749.6 |
Full Script
Below and available via gist with data at: https://gist.github.com/StevenMMortimer/1b4b626d3d91240a77f969ae04b37114
# load packages & custom functions --------------------------------------------- library(tidyverse) library(httr) library(cranlogs) library(ggrepel) library(scales) library(knitr) library(stringr) gh_from_url <- function(x){ if(!grepl(',', x)){ x <- strsplit(x, " ")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } else { x <- strsplit(x, ",")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } x <- gsub("http://", "https://", tolower(x)) x <- gsub("www\\.github\\.com", "github.com", x) x <- gsub("/$", "", x) x <- gsub("^github.com", "https://github.com", x) x <- gsub("/issues", "", x) x <- gsub("\\.git", "", x) return(x) } aut_maintainer_from_details <- function(x){ x <- gsub("'|\"", "", x) if(grepl(',', x)){ x <- strsplit(x, "\\],")[[1]] aut_cre_ind <- grepl(pattern='\\[aut, cre|\\[cre, aut|\\[cre', x, ignore.case=TRUE) if(any(aut_cre_ind)){ x <- x[min(which(aut_cre_ind))] x <- gsub("\\[aut, cre|\\[cre, aut|\\[cre", "", x) } x <- strsplit(x, ",")[[1]][1] x <- trimws(gsub("\\]", "", x)) x <- trimws(gsub(" \\[aut", "", x)) } return(x) } gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } # authenticate to github ------------------------------------------------------- # use Hadley's key and secret myapp <- oauth_app("github", key = "56b637a5baffac62cad9", secret = "8e107541ae1791259e9987d544ca568633da2ebf") github_token <- oauth2.0_token(oauth_endpoints("github"), myapp) gtoken <- config(token = github_token) # pull list of packages -------------------------------------------------------- # get list of currently available packages on CRAN pkgs <- tools::CRAN_package_db() # remove duplicate MD5sum column since tibbles can't handle duplicate column names pkgs <- pkgs[,unique(names(pkgs))] # filter out lines any duplicates pkgs <- pkgs %>% rename(Name = Package) %>% distinct(Name, .keep_all = TRUE) # get details for each package ------------------------------------------------- all_pkg_details <- NULL # old fashioned looping! # WARNING: This takes awhile to complete for(i in 1:nrow(pkgs)){ if(i %% 100 == 0){ message(sprintf("Processing package #%s out of %s", i, nrow(pkgs))) } this_url <- pkgs[i,]$URL on_github <- FALSE this_github_url <- NA_character_ gh_stars <- NA_integer_ if(!is.null(this_url)){ on_github <- grepl('http://github.com|https://github.com|http://www.github.com', this_url) if(on_github){ this_github_url <- gh_from_url(this_url) gh_stars <- gh_star_count(this_github_url) } else { # check the BugReports URL as a backup (e.g. shiny package references GitHub this way) issues_on_github <- grepl('http://github.com|https://github.com|http://www.github.com', pkgs[i,]$BugReports) if(length(issues_on_github) == 0 || !issues_on_github){ this_github_url <- NA_character_ } else { this_github_url <- gh_from_url(pkgs[i,]$BugReports) gh_stars <- gh_star_count(this_github_url) on_github <- TRUE } } } else { this_url <- NA_character_ } downloads <- cran_downloads(pkgs[i,]$Name, from = "2014-01-01", to = "2018-06-15") all_pkg_details <- rbind(all_pkg_details, tibble(name = pkgs[i,]$Name, published = pkgs[i,]$Published, author = aut_maintainer_from_details(pkgs[i,]$Author), url = this_url, github_ind = on_github, github_url = this_github_url, downloads = sum(downloads$count), stars = gh_stars ) ) } # basic summary stats ---------------------------------------------------------- # remove observations where the GitHub URL refers to a repository that # is not specific to R and therefore might have an inflated star count all_pkg_details_clean <- all_pkg_details %>% filter(!(name %in% c('xgboost', 'h2o', 'feather'))) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) # proportion of all packages listing github sum(all_pkg_details$github_ind) mean(all_pkg_details$github_ind) # proportion of packages with stars mean(!is.na(all_pkg_details$stars)) # typical number of stars per package mean(all_pkg_details_clean$stars, na.rm=TRUE) median(all_pkg_details_clean$stars, na.rm=TRUE) max(all_pkg_details_clean$stars, na.rm=TRUE) # typical number of downloads per package mean(all_pkg_details_clean$downloads, na.rm=TRUE) median(all_pkg_details_clean$downloads, na.rm=TRUE) # percent of packages over 10 stars mean(all_pkg_details_clean$stars > 10, na.rm=TRUE) mean(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) median(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) # stars histogram -------------------------------------------------------------- ggplot(data=all_pkg_details_clean, mapping=aes(stars)) + geom_histogram(aes(fill=..count..), bins=60) + scale_x_continuous(trans = "log1p", breaks=c(0,1,2,3,10,100,1000,3000)) + labs(x = "Stars", y = "Count", fill = "Count", caption = "Source: api.github.com as of 6/16/18") + ggtitle("Distribution of GitHub Stars on R Packages") + theme_bw() + theme(panel.grid.minor = element_blank(), plot.caption=element_text(hjust = 0)) # stars to downloads scatterplot ----------------------------------------------- plot_dat <- all_pkg_details_clean idx_label <- which(with(plot_dat, downloads > 10000000 | stars > 1000)) plot_dat$name2 <- plot_dat$name plot_dat$name <- "" plot_dat$name[idx_label] <- plot_dat$name2[idx_label] ggplot(data=plot_dat, aes(stars, downloads, label = name)) + geom_point(color = ifelse(plot_dat$name == "", "grey50", "red")) + geom_text_repel(box.padding = .5) + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + labs(x = "GitHub Stars", y = "CRAN Downloads", caption = "Sources:\napi.github.com as of 6/16/18\ncranlogs as of 1/1/14 - 6/15/18") + ggtitle("Relationship Between CRAN Downloads and GitHub Stars") + theme_bw() + theme(plot.caption=element_text(hjust = 0)) # author stats ----------------------------------------------------------------- # summary by author authors_detail <- all_pkg_details_clean %>% group_by(author) %>% summarize(downloads = sum(downloads, na.rm=TRUE), stars = sum(stars, na.rm=TRUE)) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) %>% arrange(desc(downloads)) # popular authors pop_authors <- tibble(author = c('Hadley Wickham', 'Dirk Eddelbuettel', 'Yihui Xie', 'Winston Chang', 'Jennifer Bryan', 'JJ Allaire', 'Jeroen Ooms', 'Scott Chamberlain', 'Jim Hester', 'Kirill Müller'), notable_packages = c('ggplot2, dplyr, httr', 'Rcpp, BH', 'knitr, rmarkdown, bookdown', 'R6, shiny', 'readxl, gapminder, googlesheets', 'rstudioapi, reticulate, tensorflow', 'jsonlite, curl, openssl', 'geojsonio, taxize', 'devtools, memoise, readr', 'tibble, DBI') ) author_stats <- pop_authors %>% inner_join(., authors_detail, by='author') %>% select(author, notable_packages, downloads, stars, downloads_per_star) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # single author #all_pkg_details_clean %>% filter(author == 'Dirk Eddelbuettel') %>% arrange(desc(downloads)) # top 10 lists ----------------------------------------------------------------- # Top 10 Most Starred Packages top_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(stars)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Most Downloaded Packages with stars top_downloaded <- all_pkg_details_clean %>% filter(!is.na(stars)) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Bottom 10 Packages by Downloads per Star (frequently starred) frequently_starred <- all_pkg_details_clean %>% filter(downloads > 100) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(downloads_per_star) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 2)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Packages by Downloads per Star (infrequently starred) infrequently_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads_per_star)) %>% slice(1:10) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.