R trends in 2015 (based on cranlogs)
[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.
Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
library(rvest) library(dplyr) # devtools::install_github("hadley/multidplyr") library(multidplyr) library(magrittr) library(lubridate) getCranberriesElmnt <- function(txt, elmnt_name){ desc <- grep(sprintf("^%s:", elmnt_name), txt) if (length(desc) == 1){ txt <- txt[desc:length(txt)] end <- grep("^[A-Za-z/@]{2,}:", txt[-1]) if (length(end) == 0) end <- length(txt) else end <- end[1] desc <- txt[1:end] %>% gsub(sprintf("^%s: (.+)", elmnt_name), "\1", .) %>% paste(collapse = " ") %>% gsub("[ ]{2,}", " ", .) %>% gsub(" , ", ", ", .) }else if (length(desc) == 0){ desc <- paste("No", tolower(elmnt_name)) }else{ stop("Could not find ", elmnt_name, " in text: n", paste(txt, collapse = "n")) } return(desc) } convertCharset <- function(txt){ if (grepl("Windows", Sys.info()["sysname"])) txt <- iconv(txt, from = "UTF-8", to = "cp1252") return(txt) } getAuthor <- function(txt, package){ author <- getCranberriesElmnt(txt, "Author") if (grepl("No author|See AUTHORS file", author)){ author <- getCranberriesElmnt(txt, "Maintainer") } if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || is.null(author) || nchar(author) <= 2){ cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html", package)) author <- cran_txt %>% html_nodes("tr") %>% html_text %>% convertCharset %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% .[grep("^Author", .)] %>% gsub(".*n", "", .) # If not found then the package has probably been # removed from the repository if (length(author) == 1) author <- author else author <- "No author" } # Remove stuff such as: # [cre, auth] # (worked on the...) # <[email protected]> # "John Doe" author %<>% gsub("^Author: (.+)", "\1", .) %>% gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% gsub("\([^)]+\)", " ", .) %>% gsub("([ ]*<[^>]+>)", " ", .) %>% gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% gsub("[ ]{2,}", " ", .) %>% gsub("(^[ '"]+|[ '"]+$)", "", .) %>% gsub(" , ", ", ", .) return(author) } getDate <- function(txt, package){ date <- grep("^Date/Publication", txt) if (length(date) == 1){ date <- txt[date] %>% gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*", "\1", .) }else{ cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html", package)) date <- cran_txt %>% html_nodes("tr") %>% html_text %>% convertCharset %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% .[grep("^Published", .)] %>% gsub(".*n", "", .) # The main page doesn't contain the original date if # new packages have been submitted, we therefore need # to check first entry in the archives if(cran_txt %>% html_nodes("tr") %>% html_text %>% gsub("(^[ tn]+|[ tn]+$)", "", .) %>% grepl("^Old.{1,4}sources", .) %>% any){ archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/", package)) pkg_date <- archive_txt %>% html_nodes("tr") %>% lapply(function(x) { nodes <- html_nodes(x, "td") if (length(nodes) == 5){ return(nodes[3] %>% html_text %>% as.Date(format = "%d-%b-%Y")) } }) %>% .[sapply(., length) > 0] %>% .[!sapply(., is.na)] %>% head(1) if (length(pkg_date) == 1) date <- pkg_date[[1]] } } date <- tryCatch({ as.Date(date) }, error = function(e){ "Date missing" }) return(date) } getNewPkgStats <- function(published_in){ # The parallel is only for making cranlogs requests # we can therefore have more cores than actual cores # as this isn't processor intensive while there is # considerable wait for each http-request cl <- create_cluster(parallel::detectCores() * 4) parallel::clusterEvalQ(cl, { library(cranlogs) }) set_default_cluster(cl) on.exit(stop_cluster()) berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/")) pkgs <- # Select the divs of the package class html_nodes(berries, ".package") %>% # Extract the text html_text %>% # Split the lines strsplit("[n]+") %>% # Now clean the lines lapply(., function(pkg_txt) { pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, USE.NAMES = FALSE)] %>% gsub("^[ t]+", "", .) }) # Now we select the new packages new_packages <- pkgs %>% # The first line is key as it contains the text "New package" sapply(., function(x) x[1], USE.NAMES = FALSE) %>% grep("^New package", .) %>% pkgs[.] %>% # Now we extract the package name and the date that it was published # and merge everything into one table lapply(function(txt){ txt <- convertCharset(txt) ret <- data.frame( name = gsub("^New package ([^ ]+) with initial .*", "\1", txt[1]), stringsAsFactors = FALSE ) ret$desc <- getCranberriesElmnt(txt, "Description") ret$author <- getAuthor(txt, ret$name) ret$date <- getDate(txt, ret$name) return(ret) }) %>% rbind_all %>% # Get the download data in parallel partition(name) %>% do({ down <- cran_downloads(.$name[1], from = max(as.Date("2015-01-01"), .$date[1]), to = "2015-12-31")$count cbind(.[1,], data.frame(sum = sum(down), avg = mean(down)) ) }) %>% collect %>% ungroup %>% arrange(desc(avg)) return(new_packages) } pkg_list <- lapply(2010:2015, getNewPkgStats) pkgs <- rbind_all(pkg_list) %>% mutate(time = as.numeric(as.Date("2016-01-01") - date), year = format(date, "%Y")) |
Downloads and time on CRAN
The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:
1 2 3 4 5 6 7 8 |
pkgs %<>% mutate(time_yrs = time/365.25) fit <- lm(avg ~ time_yrs, data = pkgs) # Test for non-linearity library(splines) anova(fit, update(fit, .~.-time_yrs+ns(time_yrs, 2))) |
Analysis of Variance Table Model 1: avg ~ time Model 2: avg ~ ns(time, 2) Res.Df RSS Df Sum of Sq F Pr(>F) 1 7348 189661922 2 7347 189656567 1 5355.1 0.2075 0.6488Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
library(quantreg) library(htmlTable) lapply(c(.5, .75, .95, .99), function(tau){ rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau) rq_sum <- summary(rq_fit) c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% paste(collapse = " to ")) }) %>% do.call(rbind, .) %>% htmlTable(rnames = c("Median", "Upper quartile", "Top 5%", "Top 1%")) |
Estimate | 95 % CI | |
---|---|---|
Median | 0.6 | 0.6 to 0.6 |
Upper quartile | 1.2 | 1.2 to 1.1 |
Top 5% | 9.7 | 11.9 to 7.6 |
Top 1% | 182.5 | 228.2 to 136.9 |
Top downloaded packages
In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).Downloads | ||||||
---|---|---|---|---|---|---|
Name | Author | Total | Average/day | Description | ||
Top 10 packages published in 2015 | ||||||
xml2 | Hadley Wickham, Jeroen Ooms, RStudio, R Foundation | 348,222 | 1635 | Work with XML files … | ||
rversions | Gabor Csardi | 386,996 | 1524 | Query the main R SVN… | ||
git2r | Stefan Widgren | 411,709 | 1303 | Interface to the lib… | ||
praise | Gabor Csardi, Sindre Sorhus | 96,187 | 673 | Build friendly R pac… | ||
readxl | David Hoerl | 99,386 | 379 | Import excel files i… | ||
readr | Hadley Wickham, Romain Francois, R Core Team, RStudio | 90,022 | 337 | Read flat/tabular te… | ||
DiagrammeR | Richard Iannone | 84,259 | 236 | Create diagrams and … | ||
visNetwork | Almende B.V. (vis.js library in htmlwidgets/lib, | 41,185 | 233 | Provides an R interf… | ||
plotly | Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy | 9,745 | 217 | Easily translate ggp… | ||
DT | Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc | 24,806 | 120 | Data objects in R ca… | ||
Top 10 packages published in 2014 | ||||||
stringi | Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc. | 1,316,900 | 3608 | stringi allows for v… | ||
magrittr | Stefan Milton Bache and Hadley Wickham | 1,245,662 | 3413 | Provides a mechanism… | ||
mime | Yihui Xie | 1,038,591 | 2845 | This package guesses… | ||
R6 | Winston Chang | 920,147 | 2521 | The R6 package allow… | ||
dplyr | Hadley Wickham, Romain Francois | 778,311 | 2132 | A fast, consistent t… | ||
manipulate | JJ Allaire, RStudio | 626,191 | 1716 | Interactive plotting… | ||
htmltools | RStudio, Inc. | 619,171 | 1696 | Tools for HTML gener… | ||
curl | Jeroen Ooms | 599,704 | 1643 | The curl() function … | ||
lazyeval | Hadley Wickham, RStudio | 572,546 | 1569 | A disciplined approa… | ||
rstudioapi | RStudio | 515,665 | 1413 | This package provide… | ||
Top 10 packages published in 2013 | ||||||
jsonlite | Jeroen Ooms, Duncan Temple Lang | 906,421 | 2483 | This package is a fo… | ||
BH | John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois | 691,280 | 1894 | Boost provides free … | ||
highr | Yihui Xie and Yixuan Qiu | 641,052 | 1756 | This package provide… | ||
assertthat | Hadley Wickham | 527,961 | 1446 | assertthat is an ext… | ||
httpuv | RStudio, Inc. | 310,699 | 851 | httpuv provides low-… | ||
NLP | Kurt Hornik | 270,682 | 742 | Basic classes and me… | ||
TH.data | Torsten Hothorn | 242,060 | 663 | Contains data sets u… | ||
NMF | Renaud Gaujoux, Cathal Seoighe | 228,807 | 627 | This package provide… | ||
stringdist | Mark van der Loo | 123,138 | 337 | Implements the Hammi… | ||
SnowballC | Milan Bouchet-Valat | 104,411 | 286 | An R interface to th… | ||
Top 10 packages published in 2012 | ||||||
gtable | Hadley Wickham | 1,091,440 | 2990 | Tools to make it eas… | ||
knitr | Yihui Xie | 792,876 | 2172 | This package provide… | ||
httr | Hadley Wickham | 785,568 | 2152 | Provides useful tool… | ||
markdown | JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte | 636,888 | 1745 | Markdown is a plain-… | ||
Matrix | Douglas Bates and Martin Maechler | 470,468 | 1289 | Classes and methods … | ||
shiny | RStudio, Inc. | 427,995 | 1173 | Shiny makes it incre… | ||
lattice | Deepayan Sarkar | 414,716 | 1136 | Lattice is a powerfu… | ||
pkgmaker | Renaud Gaujoux | 225,796 | 619 | This package provide… | ||
rngtools | Renaud Gaujoux | 225,125 | 617 | This package contain… | ||
base64enc | Simon Urbanek | 223,120 | 611 | This package provide… | ||
Top 10 packages published in 2011 | ||||||
scales | Hadley Wickham | 1,305,000 | 3575 | Scales map data to a… | ||
devtools | Hadley Wickham | 738,724 | 2024 | Collection of packag… | ||
RcppEigen | Douglas Bates, Romain Francois and Dirk Eddelbuettel | 634,224 | 1738 | R and Eigen integrat… | ||
fpp | Rob J Hyndman | 583,505 | 1599 | All data sets requir… | ||
nloptr | Jelmer Ypma | 583,230 | 1598 | nloptr is an R inter… | ||
pbkrtest | Ulrich Halekoh Søren Højsgaard | 536,409 | 1470 | Test in linear mixed… | ||
roxygen2 | Hadley Wickham, Peter Danenberg, Manuel Eugster | 478,765 | 1312 | A Doxygen-like in-so… | ||
whisker | Edwin de Jonge | 413,068 | 1132 | logicless templating… | ||
doParallel | Revolution Analytics | 299,717 | 821 | Provides a parallel … | ||
abind | Tony Plate and Richard Heiberger | 255,151 | 699 | Combine multi-dimens… | ||
Top 10 packages published in 2010 | ||||||
reshape2 | Hadley Wickham | 1,395,099 | 3822 | Reshape lets you fle… | ||
labeling | Justin Talbot | 1,104,986 | 3027 | Provides a range of … | ||
evaluate | Hadley Wickham | 862,082 | 2362 | Parsing and evaluati… | ||
formatR | Yihui Xie | 640,386 | 1754 | This package provide… | ||
minqa | Katharine M. Mullen, John C. Nash, Ravi Varadhan | 600,527 | 1645 | Derivative-free opti… | ||
gridExtra | Baptiste Auguie | 581,140 | 1592 | misc. functions | ||
memoise | Hadley Wickham | 552,383 | 1513 | Cache the results of… | ||
RJSONIO | Duncan Temple Lang | 414,373 | 1135 | This is a package th… | ||
RcppArmadillo | Romain Francois and Dirk Eddelbuettel | 410,368 | 1124 | R and Armadillo inte… | ||
xlsx | Adrian A. Dragulescu | 401,991 | 1101 | Provide R functions … |
R-star authors
Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
top_coders <- list( "2015" = pkgs %>% filter(format(date, "%Y") == 2015) %>% partition(author) %>% do({ authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]] authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)] if (length(authors) >= 1){ # If multiple authors the statistic is split among # them but with an added 20% for the extra collaboration # effort that a multi-author envorionment calls for .$sum <- round(.$sum/length(authors)*1.2) .$avg <- .$avg/length(authors)*1.2 ret <- . ret$author <- authors[1] for (m in authors[-1]){ tmp <- . tmp$author <- m ret <- rbind(ret, tmp) } return(ret) }else{ return(.) } }) %>% collect() %>% group_by(author) %>% summarise(download_ave = round(sum(avg)), no_packages = n(), packages = paste(name, collapse = ", ")) %>% select(author, download_ave, no_packages, packages) %>% collect() %>% arrange(desc(download_ave)) %>% head(10), "all" = pkgs %>% partition(author) %>% do({ if (grepl("Jeroen Ooms", .$author)) browser() authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]] authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)] if (length(authors) >= 1){ # If multiple authors the statistic is split among # them but with an added 20% for the extra collaboration # effort that a multi-author envorionment calls for .$sum <- round(.$sum/length(authors)*1.2) .$avg <- .$avg/length(authors)*1.2 ret <- . ret$author <- authors[1] for (m in authors[-1]){ tmp <- . tmp$author <- m ret <- rbind(ret, tmp) } return(ret) }else{ return(.) } }) %>% collect() %>% group_by(author) %>% summarise(download_ave = round(sum(avg)), no_packages = n(), packages = paste(name, collapse = ", ")) %>% select(author, download_ave, no_packages, packages) %>% collect() %>% arrange(desc(download_ave)) %>% head(30)) interactiveTable( do.call(rbind, top_coders) %>% mutate(download_ave = txtInt(download_ave)), align = "lrr", header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"), tspanner = c("Top coders 2015", "Top coders 2010-2015"), n.tspanner = sapply(top_coders, nrow), minimized.columns = 4, rnames = FALSE, col.rgroup = c("white", "#F0F0FF")) |
Coder | Total ave. downloads | No. of packages | Packages |
---|---|---|---|
Top coders 2015 | |||
Gabor Csardi | 2,312 | 11 | sankey, franc, rvers… |
Stefan Widgren | 1,563 | 1 | git2r |
RStudio | 781 | 16 | shinydashboard, with… |
Hadley Wickham | 695 | 12 | withr, cellranger, c… |
Jeroen Ooms | 541 | 10 | rjade, js, sodium, w… |
Richard Cotton | 501 | 22 | assertive.base, asse… |
R Foundation | 490 | 1 | xml2 |
David Hoerl | 455 | 1 | readxl |
Sindre Sorhus | 409 | 2 | praise, clisymbols |
Richard Iannone | 294 | 2 | DiagrammeR, stationa… |
Top coders 2010-2015 | |||
Hadley Wickham | 32,115 | 55 | swirl, lazyeval, ggp… |
Yihui Xie | 9,739 | 18 | DT, Rd2roxygen, high… |
RStudio | 9,123 | 25 | shinydashboard, lazy… |
Jeroen Ooms | 4,221 | 25 | JJcorr, gdtools, bro… |
Justin Talbot | 3,633 | 1 | labeling |
Winston Chang | 3,531 | 17 | shinydashboard, font… |
Gabor Csardi | 3,437 | 26 | praise, clisymbols, … |
Romain Francois | 2,934 | 20 | int64, LSD, RcppExam… |
Duncan Temple Lang | 2,854 | 6 | RMendeley, jsonlite,… |
Adrian A. Dragulescu | 2,456 | 2 | xlsx, xlsxjars |
JJ Allaire | 2,453 | 7 | manipulate, htmlwidg… |
Simon Urbanek | 2,369 | 15 | png, fastmatch, jpeg… |
Dirk Eddelbuettel | 2,094 | 33 | Rblpapi, RcppSMC, RA… |
Stefan Milton Bache | 2,069 | 3 | import, blatr, magri… |
Douglas Bates | 1,966 | 5 | PKPDmodels, RcppEige… |
Renaud Gaujoux | 1,962 | 6 | NMF, doRNG, pkgmaker… |
Jelmer Ypma | 1,933 | 2 | nloptr, SparseGrid |
Rob J Hyndman | 1,933 | 3 | hts, fpp, demography |
Baptiste Auguie | 1,924 | 2 | gridExtra, dielectri… |
Ulrich Halekoh Søren Højsgaard | 1,764 | 1 | pbkrtest |
Martin Maechler | 1,682 | 11 | DescTools, stabledis… |
Mirai Solutions GmbH | 1,603 | 3 | XLConnect, XLConnect… |
Stefan Widgren | 1,563 | 1 | git2r |
Edwin de Jonge | 1,513 | 10 | tabplot, tabplotGTK,… |
Kurt Hornik | 1,476 | 12 | movMF, ROI, qrmtools… |
Deepayan Sarkar | 1,369 | 4 | qtbase, qtpaint, lat… |
Tyler Rinker | 1,203 | 9 | cowsay, wakefield, q… |
Yixuan Qiu | 1,131 | 12 | gdtools, svglite, hi… |
Revolution Analytics | 1,011 | 4 | doParallel, doSMP, r… |
Torsten Hothorn | 948 | 7 | MVA, HSAUR3, TH.data… |
My own 2015-R-experience
My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable. When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous. When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:- DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
- checkmate A neat package for checking function arguments.
- covr An excellent package for testing how much of a package’s code is tested.
- rex A package for making regular easier.
- openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
- R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.
To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.