Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is part of a series called The Missing Semester of Your DS Education.
Introduction
When I was first learning to program, I’d face problems that would require (or at least were just made easier by using) library code. For instance, I learned about the Tidyverse the hard way my first summer in college when I was interning at a FinTech firm doing lots of data wrangling, because I had just finished spending the summer very hackily implementing group_by
and summarize
by hand using lots and lots of layers of nested for loops. That experience taught me an important lesson: I realized that thinking “Someone else must have solved this problem before” and then seeing what they’ve done is a very practical way to solve problems.
And with that realization (that I could use libraries for data wrangling), I started writing lots of lines of code like this one:
install.packages("tidyverse")
or the not-quite-equivalent Python:
pip install pandas
And as soon as I began doing that, I started running into one of the most common, if not the most common, source of headaches for new programmers: Dependency conflicts. I was getting errors like these:
> Error: package or namespace load failed for ‘foo’ in loadNamespace(j <- i[[1L]], > c(lib.loc, .libPaths()), versionCheck = vI[[j]]): namespace ‘bar’ 0.6-1 is being loaded, but >= 0.8 is required< section id="whats-the-problem" class="level2">
What’s the Problem?
At the time, I saw these types of errors and was frustrated. I didn’t understand them – why is a package requiring a specific version of another package? And I also thought there was a quick fix: I’d just install the required version of the dependency and everything would work again.
Until I inevitably ran into the same problem again soon after, while trying to install some other dependency.
At its core, I was experiencing my first foray into what might be considered a semi-pro version of dependency hell. I thought I could “solve” my problem of incompatible dependencies with duct tape – installing the correct version in order to fix the immediate error – but that doesn’t actually work in practice, since the same issue is bound to come up again with another version of a different library, or in a different project, or on another day.
< section id="dependency-resolution" class="level2">Dependency Resolution
In practice, this is a dependency management problem. And the basic story goes as follows.
You start on your programming journey, and you want to use some great library – take dplyr
, for example – for some of your work. So you install.packages("dplyr")
and it installs the most recent version. Then you decide you’re interested in working with spatial data, so you install sf
with install.packages("sf")
. Next, you realize you need to work with Census data (what else are you going to be making maps of, after all?), so you install.packages("tidycensus")
and with those three packages, you do your project.
But then next week you have a homework assignment, which, entirely hypothetically, requires you to install some niche package like rethinking
, so, of course, you install.packages("rethinking")
and you do your homework.
And then later that night, you get curious about seeing what the Elon Jet Tracker has been up to on Twitter, and so you install rtweet
: install.packages("rtweet")
, but this time, you get one of the errors above about a namespace clash.
And so now, you’re at a block in the road. You can either upgrade or downgrade the dependency that’s clashing and risk one of your other projects breaking, or you can try to get around the problem another way, such as using an old version of rtweet
or not using rtweet
at all.
This is the point at which former you – and former me – should have been thinking “someone must have solved this problem.”
< section id="declaring-your-dependencies" class="level2">Declaring Your Dependencies
The root cause of the errors in situations like these is that we have a dependency leakage problem. In the example I gave above, there were three separate “projects”: Your Census data analysis, your homework, and your keeping tabs on Elon’s jet. Those three projects don’t need to know about each other, and it’s actually a problem that they do. The issue is that in using install.packages("...")
(or pip install ...
) everywhere, you’re installing all of these packages globally on your machine. This means that every project you’re working on needs to use the same dependencies, even when the projects are separate.
Above, I proposed the solution to this problem that I used to use, which was to just install the version of the package that was needed by the package I was trying to install, and continue on until I was inevitably frustrated by the issue once again, generally sooner rather than later. But there’s a battle-hardened way of solving this problem: Declaring your dependencies, and using a dependency isolation tool.
In R, there’s the great renv for this. In Python, I personally like Poetry, but Conda, pipenv, or just a plain old virtualenv would work just fine too. The key is that you want to be declaring the dependencies that your project needs in some kind of file (such as an DESCRIPTION
file in R, or a pyproject.toml
or a requirements.txt
in Python, or a Gemfile
in Ruby, and so on), and then preferably having a dependency manager like renv
or poetry
resolve those dependencies and save the result into a lockfile, like a poetry.lock
or an renv.lock
. Then, when you want to work on your project, or homework, or whatever, you restore the dependencies as they’re recorded in that lockfile. This means that whenever you want to run a project, you know exactly what versions of every dependency need to be installed. And at the same time, if you want to add a new dependency, your dependency manager can do its best to resolve conflicts between that new dependency and all of the other libraries in your lockfile.
Dependency Isolation
The other key piece to the puzzle is dependency isolation, which is the leakage problem from before. In an ideal world, the dependencies for your project should be “isolated”, meaning that they’re only installed in your project environment (the R project, the virtualenv, etc.) and not globally on your machine.
Let’s take an example in R to see how dependency isolation works. Start by initializing a very simple R project by running the command below in a terminal.
mkdir renv-test && \ cd renv-test && \ echo "library(example)" >> test.R && \ Rscript -e "install.packages('renv', repos='http://cran.us.r-project.org'); renv::init()" && \ Rscript -e "renv::install('mrkaye97/mrkaye97.github.io:posts/series/doing-data-science/2023-05-27-renv-dependency-management/example'); renv::snapshot()" && \ cat renv.lock
You should see some R output about installing packages, and then you should see something like this:
{ "R": { "Version": "4.1.2", "Repositories": [ { "Name": "CRAN", "URL": "https://cloud.r-project.org" } ] }, "Packages": { "example": { "Package": "example", "Version": "0.1.0", "Source": "GitHub", "RemoteType": "github", "RemoteHost": "api.github.com", "RemoteUsername": "mrkaye97", "RemoteRepo": "mrkaye97.github.io", "RemoteSubdir": "posts/series/doing-data-science/2023-05-27-renv-dependency-management/example", "RemoteRef": "master", "RemoteSha": "a72bb805e175c77e7bb8a2f4fb11780b76807d4d", "Hash": "22d8e981dd94e2fab693636781631008" }, "renv": { "Package": "renv", "Version": "0.17.3", "Source": "Repository", "Repository": "CRAN", "Requirements": [ "utils" ], "Hash": "4543b8cd233ae25c6aba8548be9e747e" } } }
This last bit is the renv
lockfile. It’s where all of the dependencies of your project are enumerated. Here, the project is trivial. We’re just installing example
, which has no dependencies, so it’s the only thing (in addition to renv
itself) that’s recorded in the lockfile. But this also generalizes to arbitrarily complicated projects and sets of dependencies. As you add more packages, the lockfile grows, but you can simply run renv::restore()
to restore all of the dependencies for your project.
Next, if you run the following, you should be able to use the package just fine:
Rscript -e "library(example); hello()"
And you’ll see:
[1] "Hello, world!"
But then, try this:
cd .. && \ Rscript -e "library(example); hello()"
Unless you have another package installed globally called example
that does the same thing as my example package, you’ll see:
Error in library(example) : there is no package called ‘example’ Execution halted
But this time, this error is a feature, not a bug! This is an example of dependency isolation. When you’re inside of your renv-test
project, you install the example
package, record it in the lockfile, and can use it just fine. But as soon as you’re no longer inside of that project, this package is no longer available. This means that if you, for instance, start a new project that needs a different version of example
, you don’t need to worry about this renv-test
project being corrupted. The two projects are isolated from each other, so they can rely on separate sets of dependencies with no leakage from one to the other.
As an aside, tools like renv
and pip
are also usually good at dependency resolution, meaning they can figure out which versions of the dependencies of the packages you want to use you should install in order to make sure that everything is compatible. If you’re interested, you can read about how pip
does dependency resolution on their site.
Other Resources
There are tons of amazing resources on managing dependencies. This is a problem that every software team needs to manage, so virtually every programming language will have at least one, if not many widely-used dependency management tools. Since this is such a common problem, there’s also a lot written (and spoken!) about it. A few resources I particularly like are David Aja’s RStudio::conf 2022 talk about renv
(which I was enthusiastically in the audience for!) and something he mentions, which is The 12-Factor App, which our team’s Head of Engineering recommended to me very soon after starting at CollegeVine.
Wrapping Up
David Aja’s talk on this topic was called “You Should Be Using renv
,” and you should be. In the short run, it might feel like setting up the scaffolding for renv
or poetry
or whatever your tool of choice is adds too much overhead. But if you’re feeling that way, it’s important to keep in mind that it’s inevitable that you end up running into dependency issues, and it’ll be much harder to untangle them once you’re already in deep than it will be to build good habits from the get-go.
You should be using renv
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.