Is CRAN Holding R Back?

[This article was first published on R – Ari Lamstein, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today the R package acs was recently “archived” from CRAN. This led to the choroplethr package (which I maintain) also being “archived”. I write “archived” in quotes because CRAN stands for “Comprehensive R Archive Network”: everything on it is part of an archive and it appears that nothing is ever deleted.

You can still install both packages from CRAN. However, typing install.packages("choroplethr") will no longer work. The code below first installs binary versions of the acs package’s dependencies from CRAN. Then it installs the most recent “archived” version of acs from CRAN from source. Then it repeats the process for choroplethr:

# Install binary versions of acs' dependencies, then install acs from source
acs_imports = c("stringr", "XML", "plyr", "httr")
install.packages(acs_imports)
install.packages("https://cran.r-project.org/src/contrib/Archive/acs/acs_2.1.4.tar.gz")

# Install binary versions of choroplethr's dependencies, then install choroplethr from source
choroplethr_imports = c("Hmisc", "ggplot2", "dplyr", "R6", "WDI", "ggmap", "RgoogleMaps", "tigris", "gridExtra", "xml2", "tidyr", "tidycensus", "testthat", "choroplethrMaps", "choroplethrAdmin1")
install.packages(choroplethr_imports)
install.packages("https://cran.r-project.org/src/contrib/Archive/choroplethr/choroplethr_3.7.3.tar.gz")

There are other ways to install packages that CRAN has archived (such as using the devtools or remotes packages). But this method, which installs binary versions of as many packages as possible, appears to be the fastest.

In total choroplethr spent 11 years on CRAN and was downloaded 289k times from RStudio’s CRAN Mirror (link). I’d like to thank everyone who used the package and helped me develop it. I’d especially like to thank everyone who purchased the courses I created on it, and hired me for training or consulting. I would not have been able to take the project as far as I did without your support.

Now that choroplethr has been archived from CRAN I will no longer be maintaining it. If you would like to take over as maintainer please contact me.

Why was the acs package archived?

As I mentioned in my last post, the acs package was archived because it generates this NOTE when you run R CMD check on it:

NOTE
    ‘configure’: /bin/bash is not portable 
    ‘cleanup’: /bin/bash is not portable

I was surprised that this led to the package being archived because:

  1. The configure and cleanup files were introduced in acs v2.0, which was released in 2016 (9 years ago).
  2. The primary reason that people download acs is to download choroplethr. Of the hundreds of questions I’ve fielded on choroplethr in the last 9 years, none of them have related to these files.

In short: the issue triggering this NOTE appears to not be causing a problem for any users. While CRAN can archive a package for any reason, I am surprised that they found the cost-benefit analysis here favored archival. Archiving acs caused 3 additional packages to be archived: choroplethr, noaastormevents and synthACS. In 2024 these packages were downloaded 43,037 times from RStudio’s CRAN mirror (data from cranlogs).

The archival of choroplethr caused me to reflect on the role CRAN plays in the R ecosystem, and whether it is preventing R from being used as widely as possible.

CRAN’s Impact on R Packages

The R language itself is very limited, and it is hard to imagine doing anything interesting in it without access to user-contributed packages. CRAN does a good job of providing that access, and I think that all R users benefit from it. That being said, CRAN also has very specific requirements for packages it hosts. I think that many of these requirements limit the size and complexity of projects in the R ecosystem. Here are some examples from my own experience:

  1. When I started choroplethr I released new versions of it on a weekly basis. At my day job we released software on a weekly basis, so it seemed natural to do the same thing for my side project. CRAN required that I slow down the rate of releases. They pointed out that the CRAN Repository Policy states: ‘Submitting updates should be done responsibly and with respect for the volunteers’ time. Once a package is established (which may take several rounds), “no more than every 1–2 months” seems appropriate.’ (link)
  2. I view choroplethr as a single package. But CRAN required me to split it into several different packages (choroplethrMaps, choroplethrAdmin1 and zctaCrosswalk are still on CRAN). This made it difficult to update features that spanned multiple packages: updates to CRAN happen sequentially, and updating package A is not supposed to break package B.
  3. I wrote a lot of vignettes (i.e. long-form documentation) that showed how to use the various functions in choroplethr. Because these functions generate images, and I was using knitr and rmarkdown to generate the vignettes, they took up a lot of space. CRAN has a size limit for vignettes. I wound up removing all of them.

Cumulatively, it feels like CRAN wants R packages to be small and not updated frequently. That is their choice. But a lot of exciting projects in the data space today are being created by teams of software engineers. These projects have codebases that are much larger than the typical R package. And they are also updated more frequently than CRAN allows. Because CRAN is the de facto method of distributing R packages, my concern is that their policies are preventing these projects from using R.

Package Distribution in Python

I began learning how to develop packages in Python last year. I took a course on it, and then started contributing to an established package. (I wrote a blog post about this in January.)

Python’s equivalent of CRAN is PyPI (the Python Package Index). During the course on package development I kept on asking: “Who do I have to get approval from to push to PyPI? What are common reasons for them to reject a package?” People looked at me funny, because the entire process of package publication in Python is automated. The back and forth that R package authors have with CRAN appears to simply not exist in PyPI.

The Popularity of CRAN vs. PyPI

I began to wonder whether this issue – the restrictions that indexes place on their packages – might be affecting the quantity and complexity of the packages they host. This is probably unknowable, as there are many factors that impact which language people use for a project. But I thought to at least look into it.

As a first step I looked at the number of packages in both CRAN and PyPI. I am seeing that as of today:

  • CRAN lists “22035 [sic] available packages” (link)
  • PyPI lists “608,118 projects” (link)

So PyPI contains an order of magnitude more packages / projects than CRAN. But this comparison isn’t completely fair, because R is only used for data analysis and Python is used in multiple domains.

I then became curious about the relative popularity of R vs. Python for data science. This is also likely unknowable. But we can compare the number of downloads of popular packages in each language:

  • The most well-known R package is probably tidyverse. According to cranlogs, tidyverse was downloaded 1.2 million times last month from the R Studio CRAN mirror (link).
  • The most well-known data science package in Python is probably pandas. According to PyPI Stats, pandas was downloaded 292 million times last month (link).

So the most well-known data science package in Python has two orders of magnitude more downloads than the most well-known data science package in R. I did not expect that!

Finally, I was interested in comparing the size and complexity of R packages vs. Python packages. I’m currently taking a data engineering course (link) and we’ve been introduced to a variety of modern data engineering tools. Most of these tools have been written in Python, and none have been written in R. Given how good R is at working with data, I’ve wondered why this is. I suspect that it’s at least partially due to the restrictions that CRAN places on its packages.

As an example, we just had a workshop on dlt, a Python package that was downloaded 1.2 million times last month. As I look through the project’s release history on PyPI (link) I see that it is typically updated several times a month. So even if they had chosen to write it in R, they would not have been able to publish it on CRAN.

While comments on my blog are disabled, feel free to contact me about this post.

To leave a comment for the author, please follow the link and comment on their blog: R – Ari Lamstein.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)