Site icon R-bloggers

Searching for R Packages

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • Searching for R packages is a vexing problem for both new and experienced R users. With over 13,000 packages already on CRAN, and new packages arriving at a rate of almost 200 per month, it is impossible to keep up. Package names can be almost anything, and they are rarely informative, so searching by name is of little help. I make it a point to look at all of the new packages arriving on CRAN each month, but after a month or so, when asked about packages related to some particular topic, more often than not, I have little more to offer than a vague memory that I saw something that might be useful.

    Fortunately, package developers have provided some very useful tools, if you know where to look. 🙂 This post presents a search strategy based on some relatively new packages I have come across in my monthly review.

    library(tidyverse)
    ## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
    ## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
    ## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
    ## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
    ## ✔ readr   1.1.1     ✔ forcats 0.3.0
    ## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    library(packagefinder)
    library(dlstats)
    library(cranly)

    packagefinder v0.0.7, which appeared on CRAN this past July, goes right to the heart of the problem and shows great promise. The function findPackage() allows you to do a keyword search through the metadata of all CRAN packages. Since I am researching a possible post on Permutation Tests, I thought I would give packagefinder::findPackage() the most straightforward search text I could think of. (Note that the link for Permutation Tests above goes to an example by Thomas Leeper that references the coin package. This is a pretty strong hint that I expect to find coin prominently listed among the results.)

    Also note, that making the output a tibble is not just obsessive-compulsive tidy behavior. The default print method sends the output to the Viewer in the RStudio IDE.

    pt_pkg <- as.tibble(findPackage("permutation test"))
    ## 
    ##  59  out of  13256  CRAN packages found in 6 seconds.
    pt_pkg
    ## # A tibble: 59 x 5
    ##    SCORE NAME       DESC_SHORT                          DOWNL_TOTAL  GO   
    ##    <dbl> <chr>      <chr>                               <S3: format> <fct>
    ##  1   100 permutes   Permutation Tests for Time Series …  NA          8300 
    ##  2    75 AUtests    Approximate Unconditional and Perm…  NA          502  
    ##  3    75 jmuOutlier Permutation Tests for Nonparametri…  NA          5564 
    ##  4    75 lmPerm     Permutation Tests for Linear Models  NA          6083 
    ##  5    75 NetRep     Permutation Testing Network Module…  NA          7453 
    ##  6    75 perm       Exact or Asymptotic permutation te…  NA          8289 
    ##  7    75 permDep    Permutation Tests for General Depe…  NA          8292 
    ##  8    75 permuco    "Permutation Tests for Regression,…  NA          8297 
    ##  9    75 RATest     Randomization Tests                  NA          9287 
    ## 10    75 treeperm   Exact and Asymptotic K Sample Perm…  NA          12442
    ## # ... with 49 more rows

    Unfortunately, the package is very new and not well-documented. It is not clear how SCORE is computed, and DOWNL_TOTAL is replete with NAs. Nevertheless, the function does seem to find packages. I can’t vouch for its completeness, but when I tried it out on some topics with which it I am familiar, it did a pretty thorough job. Note that findPackage() allows a user to set a weights parameter that affects how the search “hits in the package’s title, short description and long description”. So far, I have not found this to be particularly useful, but I have not spent a lot of time with it, either.

    The next line of code just selects the columns we will be using.

    pt_pkg <- select(pt_pkg, NAME, DESC_SHORT)

    Now that we have a list of packages of interest, it would be nice to have an indication of the quality and usefulness of the packages selected. A natural measure of usefulness is the number of times the package has been downloaded. For this, we turn to the cran_stats() function from the dlstats package. This function takes a vector of packages names as inputs, queries the RStudio download logs, and returns a data frame listing the number of downloads by month for each package.

    pt_downloads <- cran_stats(pt_pkg$NAME)
    dim(pt_downloads)
    ## [1] 2784    4
    head(pt_downloads)
    ##           start        end downloads  package
    ## 4485 2018-05-01 2018-05-31        52 permutes
    ## 4544 2018-06-01 2018-06-30        89 permutes
    ## 4603 2018-07-01 2018-07-31        92 permutes
    ## 4662 2018-08-01 2018-08-31        74 permutes
    ## 4721 2018-09-01 2018-09-30       227 permutes
    ## 4780 2018-10-01 2018-10-22       142 permutes

    Just a little wrangling yields a data frame that lists total downloads for each package over its lifespan.

    top_downloads <- pt_downloads %>% group_by(package) %>% 
                     summarize(downloads = sum(downloads)) %>%
                     arrange(desc(downloads))
    head(top_downloads,10)
    ## # A tibble: 10 x 2
    ##    package        downloads
    ##    <fct>              <int>
    ##  1 coin             1103426
    ##  2 exactRankTests    137674
    ##  3 RVAideMemoire     108837
    ##  4 perm               97071
    ##  5 logcondens         83033
    ##  6 HardyWeinberg      55735
    ##  7 biotools           47694
    ##  8 smacof             45257
    ##  9 SNPassoc           38920
    ## 10 broman             30956

    As expected, coin has flipped to the head of the list. Plotting the downloads over time shows that the package has increased in popularity over the past five years, and it looks like people have been doing a crazy amount of permutation testing over the past year or so.

    top_pkgs <- pt_downloads %>% filter(package %in% top_downloads$package[1:3])
    ggplot(top_pkgs, aes(end, downloads, group=package, color=package)) +
      geom_line() + geom_point(aes(shape=package))

    One way to gauge the quality and reliability of a package is to see how many other packages depend on it. These would be the packages listed as “Reverse depends” and “Reverse imports” on the CRAN page for a package. Using the canonical link, https://cran.r-project.org/package=coin, we see that 24 packages are listed in these fields on the coin page.

    Likewise, knowing something of an author’s background, his or her experience writing other R packages, and prominent R developers he or she may have collaborated with is also helpful in assessing whether to give a newly found package is worth a try. The same link above also shows the package’s authors. Checking the Contributors page for the R Project, we see that two authors are members of R Core and the lead author, Torsten Hothorn, is listed with the contributors who have provided “invaluable help”. The background and collaborators couldn’t be better.

    In most cases, background checks aren’t so easy. However, with the help of the build_network() function from the cranly package, it is simple to track down an author’s collaboration network. Here, we see that Torston has an extensive network of collaborators.

    p_db <- tools::CRAN_package_db()
    clean_p_db <- clean_CRAN_db(p_db)
    author_net <- build_network(object = clean_p_db, perspective = "author")
    plot(author_net, author = "Torsten Hothorn", exact = FALSE)

    It is also helpful to know who the most prolific CRAN package authors are. You can generally count on packages from this crew being top-shelf.

    author_summary <- summary(author_net)
    ## Warning in closeness(cranly_graph, normalized = FALSE): At centrality.c:
    ## 2784 :closeness centrality is not well-defined for disconnected graphs
    plot(author_summary)

    I am not claiming that the path I have taken here is the best, or even unique. I have by no means exhausted the possibilities with the packages I have highlighted. Previous posts explore cranly and the tools::CRAN_package_db() function in a little more depth, but there is much more to explore.

    Finally, it would be remiss of me not to mention that the first thing anyone, novice or expert, should do when looking for a package to solve some new problem, or even to get an indication of the quality of a package, is to examine the CRAN Task Views. These are lists of packages curated by experts and organized into functional areas. With just a little searching, you will see that coin shows up in multiple task views.

    To leave a comment for the author, please follow the link and comment on their blog: R Views.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.