R trends in 2015 (based on cranlogs)

Posted on January 20, 2016 by Max Gordon in R bloggers | 0 Comments

[This article was first published on R – G-Forge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What are the current tRends? The image is CC from coco + kelly.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience. Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).

^?View Code RSPLUS

library(rvest)
library(dplyr)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(magrittr)
library(lubridate)
 
getCranberriesElmnt <- function(txt, elmnt_name){
  desc <- grep(sprintf("^%s:", elmnt_name), txt)
  if (length(desc) == 1){
    txt <- txt[desc:length(txt)]
    end <- grep("^[A-Za-z/@]{2,}:", txt[-1])
    if (length(end) == 0)
      end <- length(txt)
    else
      end <- end[1]
 
    desc <-
      txt[1:end] %>% 
      gsub(sprintf("^%s: (.+)", elmnt_name),
           "\1", .) %>% 
      paste(collapse = " ") %>% 
      gsub("[ ]{2,}", " ", .) %>% 
      gsub(" , ", ", ", .)
  }else if (length(desc) == 0){
    desc <- paste("No", tolower(elmnt_name))
  }else{
    stop("Could not find ", elmnt_name, " in text: n",
         paste(txt, collapse = "n"))
  }
  return(desc)
}
 
convertCharset <- function(txt){
  if (grepl("Windows", Sys.info()["sysname"]))
    txt <- iconv(txt, from = "UTF-8", to = "cp1252")
  return(txt)
}
 
getAuthor <- function(txt, package){
  author <- getCranberriesElmnt(txt, "Author")
  if (grepl("No author|See AUTHORS file", author)){
    author <- getCranberriesElmnt(txt, "Maintainer")
  }
 
  if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || 
      is.null(author) ||
      nchar(author)  <= 2){
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    author <- cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Author", .)] %>% 
      gsub(".*n", "", .)
 
    # If not found then the package has probably been
    # removed from the repository
    if (length(author) == 1)
      author <- author
    else
      author <- "No author"
  }
 
  # Remove stuff such as:
  # [cre, auth]
  # (worked on the...)
  # <[email protected]>
  # "John Doe"
  author %<>% 
    gsub("^Author: (.+)", 
         "\1", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("\([^)]+\)", " ", .) %>% 
    gsub("([ ]*<[^>]+>)", " ", .) %>% 
    gsub("[ ]*\[[^]]{3,}\][ ]*", " ", .) %>% 
    gsub("[ ]{2,}", " ", .) %>% 
    gsub("(^[ '"]+|[ '"]+$)", "", .) %>% 
    gsub(" , ", ", ", .)
  return(author)
}
 
getDate <- function(txt, package){
  date <- 
    grep("^Date/Publication", txt)
  if (length(date) == 1){
    date <- txt[date] %>% 
      gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
           "\1", .)
  }else{
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    date <- 
      cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
      .[grep("^Published", .)] %>% 
      gsub(".*n", "", .)
 
 
    # The main page doesn't contain the original date if 
    # new packages have been submitted, we therefore need
    # to check first entry in the archives
    if(cran_txt %>% 
       html_nodes("tr") %>% 
       html_text %>% 
       gsub("(^[ tn]+|[ tn]+$)", "", .) %>% 
       grepl("^Old.{1,4}sources", .) %>% 
       any){
      archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
                                       package))
      pkg_date <- 
        archive_txt %>% 
        html_nodes("tr") %>% 
        lapply(function(x) {
          nodes <- html_nodes(x, "td")
          if (length(nodes) == 5){
            return(nodes[3] %>% 
                     html_text %>% 
                     as.Date(format = "%d-%b-%Y"))
          }
        }) %>% 
        .[sapply(., length) > 0] %>% 
        .[!sapply(., is.na)] %>% 
        head(1)
 
      if (length(pkg_date) == 1)
        date <- pkg_date[[1]]
    }
  }
  date <- tryCatch({
    as.Date(date)
  }, error = function(e){
    "Date missing"
  })
  return(date)
}
 
getNewPkgStats <- function(published_in){
  # The parallel is only for making cranlogs requests
  # we can therefore have more cores than actual cores
  # as this isn't processor intensive while there is
  # considerable wait for each http-request
  cl <- create_cluster(parallel::detectCores() * 4)
  parallel::clusterEvalQ(cl, {
    library(cranlogs)
  })
  set_default_cluster(cl)
  on.exit(stop_cluster())
 
  berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
  pkgs <- 
    # Select the divs of the package class
    html_nodes(berries, ".package") %>% 
    # Extract the text
    html_text %>% 
    # Split the lines
    strsplit("[n]+") %>% 
    # Now clean the lines
    lapply(.,
           function(pkg_txt) {
             pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ t]+", "", x)) > 0}, 
                            USE.NAMES = FALSE)] %>% 
               gsub("^[ t]+", "", .) 
           })
 
  # Now we select the new packages
  new_packages <- 
    pkgs %>% 
    # The first line is key as it contains the text "New package"
    sapply(., function(x) x[1], USE.NAMES = FALSE) %>% 
    grep("^New package", .) %>% 
    pkgs[.] %>% 
    # Now we extract the package name and the date that it was published
    # and merge everything into one table
    lapply(function(txt){
      txt <- convertCharset(txt)
      ret <- data.frame(
        name = gsub("^New package ([^ ]+) with initial .*", 
                     "\1", txt[1]),
        stringsAsFactors = FALSE
      )
 
      ret$desc <- getCranberriesElmnt(txt, "Description")
      ret$author <- getAuthor(txt, ret$name)
      ret$date <- getDate(txt, ret$name)
 
      return(ret)
    }) %>% 
    rbind_all %>% 
    # Get the download data in parallel
    partition(name) %>% 
    do({
      down <- cran_downloads(.$name[1], 
                             from = max(as.Date("2015-01-01"), .$date[1]), 
                             to = "2015-12-31")$count 
      cbind(.[1,],
            data.frame(sum = sum(down), 
                       avg = mean(down))
      )
    }) %>% 
    collect %>% 
    ungroup %>% 
    arrange(desc(avg))
 
  return(new_packages)
}
 
pkg_list <- 
  lapply(2010:2015,
         getNewPkgStats)
 
pkgs <- 
  rbind_all(pkg_list) %>% 
  mutate(time = as.numeric(as.Date("2016-01-01") - date),
         year = format(date, "%Y"))

Downloads and time on CRAN

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:

^?View Code RSPLUS

pkgs %<>% 
  mutate(time_yrs = time/365.25)
fit <- lm(avg ~ time_yrs, data = pkgs)
 
# Test for non-linearity
library(splines)
anova(fit,
      update(fit, .~.-time_yrs+ns(time_yrs, 2)))

Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
  Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922                           
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn’t that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:

^?View Code RSPLUS

library(quantreg)
library(htmlTable)
lapply(c(.5, .75, .95, .99),
       function(tau){
         rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau)
         rq_sum <- summary(rq_fit)
         c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 
           `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + 
                                        c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% 
             paste(collapse = " to "))
       }) %>% 
  do.call(rbind, .) %>% 
  htmlTable(rnames = c("Median",
                       "Upper quartile",
                       "Top 5%",
                       "Top 1%"))

	Estimate	95 % CI
Median	0.6	0.6 to 0.6
Upper quartile	1.2	1.2 to 1.1
Top 5%	9.7	11.9 to 7.6
Top 1%	182.5	228.2 to 136.9

The above table conveys a slightly more interesting picture. Most packages don’t get that much attention while the top 1% truly reach the masses.

Top downloaded packages

In order to investigate what packages R users have been using during 2015 I’ve looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I’ve split the table by the package release dates. The results are available for browsing below (yes – it is the new brand interactive htmlTable that allows you to collapse cells – note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).

		Downloads
Name	Author	Total	Average/day	Description
Top 10 packages published in 2015
xml2	Hadley Wickham, Jeroen Ooms, RStudio, R Foundation	348,222	1635	Work with XML files …
rversions	Gabor Csardi	386,996	1524	Query the main R SVN…
git2r	Stefan Widgren	411,709	1303	Interface to the lib…
praise	Gabor Csardi, Sindre Sorhus	96,187	673	Build friendly R pac…
readxl	David Hoerl	99,386	379	Import excel files i…
readr	Hadley Wickham, Romain Francois, R Core Team, RStudio	90,022	337	Read flat/tabular te… Read flat/tabular text files from disk.
DiagrammeR	Richard Iannone	84,259	236	Create diagrams and … Create diagrams and flowcharts using R.
visNetwork	Almende B.V. (vis.js library in htmlwidgets/lib,	41,185	233	Provides an R interf…
plotly	Carson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy	9,745	217	Easily translate ggp…
DT	Yihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc	24,806	120	Data objects in R ca…
Top 10 packages published in 2014
stringi	Marek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc.	1,316,900	3608	stringi allows for v…
magrittr	Stefan Milton Bache and Hadley Wickham	1,245,662	3413	Provides a mechanism…
mime	Yihui Xie	1,038,591	2845	This package guesses…
R6	Winston Chang	920,147	2521	The R6 package allow…
dplyr	Hadley Wickham, Romain Francois	778,311	2132	A fast, consistent t…
manipulate	JJ Allaire, RStudio	626,191	1716	Interactive plotting…
htmltools	RStudio, Inc.	619,171	1696	Tools for HTML gener… Tools for HTML generation and output
curl	Jeroen Ooms	599,704	1643	The curl() function …
lazyeval	Hadley Wickham, RStudio	572,546	1569	A disciplined approa…
rstudioapi	RStudio	515,665	1413	This package provide…
Top 10 packages published in 2013
jsonlite	Jeroen Ooms, Duncan Temple Lang	906,421	2483	This package is a fo…
BH	John W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois	691,280	1894	Boost provides free …
highr	Yihui Xie and Yixuan Qiu	641,052	1756	This package provide…
assertthat	Hadley Wickham	527,961	1446	assertthat is an ext…
httpuv	RStudio, Inc.	310,699	851	httpuv provides low-…
NLP	Kurt Hornik	270,682	742	Basic classes and me…
TH.data	Torsten Hothorn	242,060	663	Contains data sets u…
NMF	Renaud Gaujoux, Cathal Seoighe	228,807	627	This package provide…
stringdist	Mark van der Loo	123,138	337	Implements the Hammi…
SnowballC	Milan Bouchet-Valat	104,411	286	An R interface to th…
Top 10 packages published in 2012
gtable	Hadley Wickham	1,091,440	2990	Tools to make it eas…
knitr	Yihui Xie	792,876	2172	This package provide…
httr	Hadley Wickham	785,568	2152	Provides useful tool…
markdown	JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte	636,888	1745	Markdown is a plain-…
Matrix	Douglas Bates and Martin Maechler	470,468	1289	Classes and methods …
shiny	RStudio, Inc.	427,995	1173	Shiny makes it incre…
lattice	Deepayan Sarkar	414,716	1136	Lattice is a powerfu…
pkgmaker	Renaud Gaujoux	225,796	619	This package provide…
rngtools	Renaud Gaujoux	225,125	617	This package contain…
base64enc	Simon Urbanek	223,120	611	This package provide…
Top 10 packages published in 2011
scales	Hadley Wickham	1,305,000	3575	Scales map data to a…
devtools	Hadley Wickham	738,724	2024	Collection of packag… Collection of package development tools
RcppEigen	Douglas Bates, Romain Francois and Dirk Eddelbuettel	634,224	1738	R and Eigen integrat…
fpp	Rob J Hyndman	583,505	1599	All data sets requir…
nloptr	Jelmer Ypma	583,230	1598	nloptr is an R inter…
pbkrtest	Ulrich Halekoh Søren Højsgaard	536,409	1470	Test in linear mixed…
roxygen2	Hadley Wickham, Peter Danenberg, Manuel Eugster	478,765	1312	A Doxygen-like in-so…
whisker	Edwin de Jonge	413,068	1132	logicless templating…
doParallel	Revolution Analytics	299,717	821	Provides a parallel …
abind	Tony Plate and Richard Heiberger	255,151	699	Combine multi-dimens…
Top 10 packages published in 2010
reshape2	Hadley Wickham	1,395,099	3822	Reshape lets you fle…
labeling	Justin Talbot	1,104,986	3027	Provides a range of …
evaluate	Hadley Wickham	862,082	2362	Parsing and evaluati…
formatR	Yihui Xie	640,386	1754	This package provide…
minqa	Katharine M. Mullen, John C. Nash, Ravi Varadhan	600,527	1645	Derivative-free opti…
gridExtra	Baptiste Auguie	581,140	1592	misc. functions
memoise	Hadley Wickham	552,383	1513	Cache the results of…
RJSONIO	Duncan Temple Lang	414,373	1135	This is a package th…
RcppArmadillo	Romain Francois and Dirk Eddelbuettel	410,368	1124	R and Armadillo inte…
xlsx	Adrian A. Dragulescu	401,991	1101	Provide R functions …

Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.

R-star authors

Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:

^?View Code RSPLUS

top_coders <- list(
  "2015" = 
    pkgs %>% 
    filter(format(date, "%Y") == 2015) %>% 
    partition(author) %>% 
    do({
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(10),
  "all" =
    pkgs %>% 
    partition(author) %>% 
    do({
      if (grepl("Jeroen Ooms", .$author))
        browser()
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(30))
 
interactiveTable(
  do.call(rbind, top_coders) %>% 
    mutate(download_ave = txtInt(download_ave)),
  align = "lrr",
  header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"),
  tspanner = c("Top coders 2015",
               "Top coders 2010-2015"),
  n.tspanner = sapply(top_coders, nrow),
  minimized.columns = 4, 
  rnames = FALSE, 
  col.rgroup = c("white", "#F0F0FF"))

Coder	Total ave. downloads	No. of packages	Packages
Top coders 2015
Gabor Csardi	2,312	11	sankey, franc, rvers…
Stefan Widgren	1,563	1	git2r
RStudio	781	16	shinydashboard, with…
Hadley Wickham	695	12	withr, cellranger, c…
Jeroen Ooms	541	10	rjade, js, sodium, w…
Richard Cotton	501	22	assertive.base, asse…
R Foundation	490	1	xml2
David Hoerl	455	1	readxl
Sindre Sorhus	409	2	praise, clisymbols
Richard Iannone	294	2	DiagrammeR, stationa… DiagrammeR, stationaRy
Top coders 2010-2015
Hadley Wickham	32,115	55	swirl, lazyeval, ggp…
Yihui Xie	9,739	18	DT, Rd2roxygen, high…
RStudio	9,123	25	shinydashboard, lazy…
Jeroen Ooms	4,221	25	JJcorr, gdtools, bro…
Justin Talbot	3,633	1	labeling
Winston Chang	3,531	17	shinydashboard, font…
Gabor Csardi	3,437	26	praise, clisymbols, …
Romain Francois	2,934	20	int64, LSD, RcppExam…
Duncan Temple Lang	2,854	6	RMendeley, jsonlite,…
Adrian A. Dragulescu	2,456	2	xlsx, xlsxjars
JJ Allaire	2,453	7	manipulate, htmlwidg…
Simon Urbanek	2,369	15	png, fastmatch, jpeg…
Dirk Eddelbuettel	2,094	33	Rblpapi, RcppSMC, RA…
Stefan Milton Bache	2,069	3	import, blatr, magri… import, blatr, magrittr
Douglas Bates	1,966	5	PKPDmodels, RcppEige…
Renaud Gaujoux	1,962	6	NMF, doRNG, pkgmaker…
Jelmer Ypma	1,933	2	nloptr, SparseGrid
Rob J Hyndman	1,933	3	hts, fpp, demography
Baptiste Auguie	1,924	2	gridExtra, dielectri… gridExtra, dielectric
Ulrich Halekoh Søren Højsgaard	1,764	1	pbkrtest
Martin Maechler	1,682	11	DescTools, stabledis…
Mirai Solutions GmbH	1,603	3	XLConnect, XLConnect… XLConnect, XLConnectJars, XLConnectJars
Stefan Widgren	1,563	1	git2r
Edwin de Jonge	1,513	10	tabplot, tabplotGTK,…
Kurt Hornik	1,476	12	movMF, ROI, qrmtools…
Deepayan Sarkar	1,369	4	qtbase, qtpaint, lat… qtbase, qtpaint, lattice, qtutils
Tyler Rinker	1,203	9	cowsay, wakefield, q…
Yixuan Qiu	1,131	12	gdtools, svglite, hi…
Revolution Analytics	1,011	4	doParallel, doSMP, r… doParallel, doSMP, revoIPC, checkpoint
Torsten Hothorn	948	7	MVA, HSAUR3, TH.data…

It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable. When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous. When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:

DiagrammeR An interesting new way of producing diagrams. I’ve used it for gantt charts but it allows for much more.
checkmate A neat package for checking function arguments.
covr An excellent package for testing how much of a package’s code is tested.
rex A package for making regular easier.
openxlsx I wish I didn’t have to but I still get a lot of things in Excel-format – perhaps this package solves the Excel-import inferno…
R6 The successor to reference classes – after working with the Gmisc::Transition-class I appreciate the need for a better system.

To leave a comment for the author, please follow the link and comment on their blog: R – G-Forge.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

R trends in 2015 (based on cranlogs)

Downloads and time on CRAN

Top downloaded packages

R-star authors

My own 2015-R-experience

Related

Downloads and time on CRAN

Top downloaded packages

R-star authors

My own 2015-R-experience

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)