Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the recurring debates in some spaces of the R community is about dependencies. After a few posts on Mastodon I wanted to capture my opinions on the subject to help me understand them better, and because long-form articles are much better to talk about contentious topics than short-burst posts.
Dependencies are invitations for other people to collaborate with you
Many thinkers have marvelled at the magic inside books. Carl Sagan put it in his style:
What an astonishing thing a book is. It’s a flat object made from a tree with flexible parts on which are imprinted lots of funny dark squiggles. But one glance at it and you’re inside the mind of another person, maybe somebody dead for thousands of years. Across the millennia, an author is speaking clearly and silently inside your head, directly to you. Writing is perhaps the greatest of human inventions, binding together people who never knew each other, citizens of distant epochs. Books break the shackles of time. A book is proof that humans are capable of working magic.
And its true. Writing is a fantastic technology that allows strangers to learn from each other, to collaborate, to build upon each other’s work.
Software packages are kind of like books.
By typing library(ggplot2)
I get to effortlessly use and build upon the thousands of people-hours devoted to developing and polishing a great, solid package.
When I write code that uses ggplot2, in some sense all its developers –awesome people who write great quality code– are now collaborators on my project without them even knowing me.
And that’s simply awesome.
Yes, sometimes collaboration is hard, especially when it’s not intended. The next version of ggplot2 might break my code and I will need to update it to fix it. But like a romcom protagonist who needs to learn that opening themselves to being loved is worth the risk of heartbreak, I think the occasional breaking change is a small price to pay for getting to collaborate with the community.
Dependencies make better programmers
As Miles McBain said on mastodon, taking dependencies exposes you to other people’s code, different ideas and different design patterns.
Just using other people’s code will introduce to the various naming conventions, argument types and return classes. If you take a step further and read the internals, you get to know a lot of writing styles, levels of function abstraction, etc.
This also works from the package maintainer’s perspective. Athanasia Mowinckel describes in her 2021 talk how developing and publishing a package made her aware of many programming good practices thanks to people that opened issues and Pull Requests. And it’s true; how can a developer improve their package is nobody uses and depends on it?
A dependency-rich project is better than no project
Even if I were to grant that a project with many dependencies is bad, still a brittle project is better than no project. All kinds of people do amazing things and not every project needs (or can) be a “10X project” that follows all the best practices.
R users come from very different backgrounds and have vastly different levels of expertise in software development. A lot of us are scientists who want to do their research and collaborate with the community. We probably wouldn’t be able to do our work or publish our first packages without depending on code written by others.
And I’d rather see a thousand imperfect and dependency-heavy projects bloom than to have just a few monolithic packages.
Good enough is infinitely better than nothing.
Dependency counts are barely a proxy for what’s important
The tidyverse package has dozens of dependencies, but each package is relatively focused on a small set of functionalities. But the tidyverse team could easily decide to merge all the code in all the different packages into one single monolithic package. Then the whole tidyverse could theoretically claim to be “dependency free” but it would consist on (roughly) the same code, written by the same developers, and with the same probability of breakage.
Conversely, the data.table package has no dependencies, but its codebase is massive. It’s got functions for data manipulation (~dplyr), data reshaping (~tidyr), reading and writing to files (~readr), some date manipulation (~lubridate), rolling functions and even functions to format columns (~pillar). All that functionality could conceivably be broken up into several other packages and then the data.table package might have half a dozen or so dependencies but, again, would consist on (roughly) the same code, written by the same developers, and with the same probability of breakage.
Moreover, code depending on a single package that is experimental or in active development might break a lot, whereas code depending on multiple stable and slowly-updating packages might work for ages.
So counting dependencies doesn’t really give you a good sense of dependency complexity or the probability of breakage. At best it’s a rough and indirect proxy that, in my opinion, doesn’t warrant being obsessed with.
Dependency counts fall victim of Goodhart’s Law
At some point CRAN put a soft limit on the number of packages a package could list in “Imports”. Those are packages that are needed to be installed for the package to function; what we would colloquially call dependencies. Use more than 20 packages and CRAN will shout at you (issue a “Note”) and make your life a bit harder.
So let me tell you about the SCpubr package. This is an R package that “aims to provide a streamlined way of generating publication ready plots for known Single-Cell visualizations”. By the looks of it, it’s a rather complex package that wraps a lot of functionality from other packages. However, it hasn’t got a single dependency listed under Imports. Indeed, all its 61 dependencies are listed under “Suggest”, the field reserved for packages that “are not necessarily needed” and are instead “used only in examples, tests or vignettes”. However, almost every single function in the SCpubr package uses some of those suggested packages.
There might be perfectly reasonable reasons for this package to list all its many dependencies as Suggests, but this example shows you how a package can have a massive amount of dependencies and still come clear at the dependency counting.
There are other options. Because CRAN counts the number of direct dependencies, one could easily create two dummy packages that depend on 15 other packages, and then have the main package depend on those. Now the package appears to have only 2 dependencies but in reality it depends on more than 30 packages.
So imposing a limit on the number of dependencies can change how developers list their dependencies in order to avoid that limit and thus the number of dependencies turns into an even worse indicator of dependency complexity. This is a clear example of how when a proxy measure becomes a target, it ceases to be a good proxy measure.
Conclusions
So, taking dependencies is a great way of averaging the power of the community and getting to collaborate with great developers. It makes you and the maintainer of those packages into better developers. It empowers people to create tools and write code they couldn’t otherwise create. And even if you want to increase the stability and reproducibility of code, just counting the number of dependencies is not a very meaningful exercise.
When writing a package, do be mindful of the dependencies you take. I do agree that it’s not good practice to depend on a gigantic package just to use a single function that can be easily replicated using base R.
Remember that when taking a dependency you are essentially forcing all your users to install another package, so I think it’s polite to do so only if necessary, particularly for packages with compiled code and/or system dependencies which can fail to install. This is less important for very popular packages that most R users probably already have or for packages that are dependencies of your dependencies (i.e. if your package imports ggplot2, then there’s no point in avoiding taking a dependency on tibble, which is imported by ggplot2 itself).
Much more important than the number of dependencies, consider who are developing the packages you depend on. Are they trusted developers that write good software or someone who might write even worse code than you? Can they be trusted to continue development and not orphan the package?
When writing code for you, go nuts, do whatever you want. It’s your project and your machine, so install whatever package you want and be happy. You still have to be mindful of installing external code, especially from random GitHub repositories. And if you want your code to be reproducible on the long term, use renv to pin the package versions for each project (you can check other strategies on our online course Reproducibility with R).
At the end of the day, take the dependencies you need to get the job done and create useful and fun projects with clean code that you can understand, extend, debug and maintain easily. Don’t obsess over the number of dependencies in your or anyone else’s package.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.