Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dependencies are invitations for other people to break your package.
— Josh Ulrich, private communication
Welcome to the seventeenth post in the relentlessly random R ravings series of posts, or R4 for short.
Dependencies. A truly loaded topic.
As R users, we are spoiled. Early in the history of R, Kurt Hornik and Friedrich Leisch built support for packages right into R, and started the Comprehensive R Archive Network (CRAN). And R and CRAN had a fantastic run with. Roughly twenty years later, we are looking at over 12,000 packages which can (generally) be installed with absolute ease and no suprises. No other (relevant) open source language has anything of comparable rigour and quality. This is a big deal.
And coding practices evolved and changed to play to this advantage. Packages are a near-unanimous recommendation, use of the install.packages()
and update.packages()
tooling is nearly universal, and most R users learned to their advantage to group code into interdependent packages. Obvious advantages are versioning and snap-shotting, attached documentation in the form of help pages and vignettes, unit testing, and of course continuous integration as a side effect of the package build system.
But the notion of ‘oh, let me just build another package and add it to the pool of packages’ can get carried away. A recent example I had was the work on the prrd package for parallel recursive dependency testing — coincidentally, created entirely to allow for easier voluntary tests I do on reverse dependencies for the packages I maintain. It uses a job queue for which I relied on the liteq package by Gabor which does the job: enqueue jobs, and reliably dequeue them (also in a parallel fashion) and more. It looks light enough:
R> tools::package_dependencies(package="liteq", recursive=FALSE, db=AP)$liteq [1] "assertthat" "DBI" "rappdirs" "RSQLite" R>
Two dependencies because it uses an internal SQLite database, one for internal tooling and one for configuration.
All good then? Not so fast. The devil here is the very innocuous and versatile RSQLite package because when we look at fully recursive dependencies all hell breaks loose:
R> tools::package_dependencies(package="liteq", recursive=TRUE, db=AP)$liteq [1] "assertthat" "DBI" "rappdirs" "RSQLite" "tools" [6] "methods" "bit64" "blob" "memoise" "pkgconfig" [11] "Rcpp" "BH" "plogr" "bit" "utils" [16] "stats" "tibble" "digest" "cli" "crayon" [21] "pillar" "rlang" "grDevices" "utf8" R> R> tools::package_dependencies(package="RSQLite", recursive=TRUE, db=AP)$RSQLite [1] "bit64" "blob" "DBI" "memoise" "methods" [6] "pkgconfig" "Rcpp" "BH" "plogr" "bit" [11] "utils" "stats" "tibble" "digest" "cli" [16] "crayon" "pillar" "rlang" "assertthat" "grDevices" [21] "utf8" "tools" R>
Now we went from four to twenty-four, due to the twenty-two dependencies pulled in by RSQLite.
There, my dear friend, lies madness. The moment one of these packages breaks we get potential side effects. And this is no laughing matter. Here is a tweet from Kieran posted days before a book deadline of his when he was forced to roll a CRAN package back because it broke his entire setup. (The original tweet has by now been deleted; why people do that to their entire tweet histories is somewhat I fail to comprehened too; in any case the screenshot is from a private discussion I had with a few like-minded folks over slack.)
That illustrates the quote by Josh at the top. As I too have "production code" (well, CRANberries for one relies on it), I was interested to see if we could easily amend RSQLite. And yes, we can. A quick fork and few commits later, we have something we could call ‘RSQLighter’ as it reduces the dependencies quite a bit:
R> IP <- installed.packages() # using my installed mod'ed version R> tools::package_dependencies(package="RSQLite", recursive=TRUE, db=IP)$RSQLite [1] "bit64" "DBI" "methods" "Rcpp" "BH" "bit" [7] "utils" "stats" "grDevices" "graphics" R>
That is less than half. I have not proceeded with the fork because I do not believe in needlessly splitting codebases. But this could be a viable candidate for an alternate or shadow repository with more minimal and hence more robust dependencies. Or, as Josh calls, the tinyverse.
Another maddening aspect of dependencies is the ruthless application of what we could jokingly call Metcalf’s Law: the likelihood of breakage does of course increase with the number edges in the dependency graph. A nice illustration is this post by Jenny trying to rationalize why one of the 87 (as of today) tidyverse packages has now state "ORPHANED" at CRAN:
An invitation for other people to break your code. Well put indeed. Or to put rocks up your path.
But things are not all that dire. Most folks appear to understand the issue, some even do something about it. The DBI and RMySQL packages have saner strict dependencies, maybe one day things will improve for RMariaDB and RSQLite too:
R> tools::package_dependencies(package=c("DBI", "RMySQL", "RMariaDB"), recursive=TRUE, db=AP) $DBI [1] "methods" $RMySQL [1] "DBI" "methods" $RMariaDB [1] "bit64" "DBI" "hms" "methods" "Rcpp" "BH" [7] "plogr" "bit" "utils" "stats" "pkgconfig" "rlang" R>
And to be clear, I do not believe in giving up and using everything via docker, or virtualenvs, or packrat, or … A well-honed dependency system is wonderful and the right resource to get code deployed and updated. But it required buy-in from everyone involved, and an understanding of the possible trade-offs. I think we can, and will, do better going forward.
Or else, there will always be the tinyverse …
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.