Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is part of a series called The Missing Semester of Your DS Education.
Introduction
If you’re doing data science work, it’s likely you’ll eventually come across a situation where you need to run your code somewhere else. Whether that “somewhere” is the machine of a teammate, an EC2 box, a pod in a Kubernetes cluster, a runner in your team’s CI/CD rig, on a Spark cluster, and so on depends greatly on the problem you’re solving. But the ultimate point is the same: Eventually, you’ll need to be able to package your code up, put it somewhere in the world other than your local machine, and have it run just like it has been for you.
< section id="enter-docker" class="level2">Enter: Docker
Docker seems very complicated at first glance. And there’s a lot of jargon: Images, containers, volumes, and more, and that doesn’t even begin to cover the world of container orchestration: Docker-Compose, Kubernetes, and so on. But at its core, you can think of Docker as a little environment – not too unlike your local machine – that has a file system, configuration, etc. that’s packaged up into a magical box that you can run on any computer, anywhere. Or at least on computers that have Docker installed.
It might be simplest to consider a small example.
Note that to run the example that will follow, you’ll need to have Docker installed on your machine. All of the files necessary to run this example can be found in my blog’s Github repo
I’ll use R for the example I provide in this post, but note that the same principles apply if you’re doing your work in Python, or in any other programming language.
Let’s imagine we want to print “Hello from Docker!” from R. First, make a new directory called docker-example
(or whatever you want to call it):
mkdir docker-example && cd docker-example
And then we might do something like the following:
## Dockerfile FROM rocker/r-ver:4.2.0 CMD ["Rscript", "-e", "'Hello from Docker!'"]
If you paste that into a file called Dockerfile
, you can then run:
docker build --tag example .
Which will build the Docker image by running each command you’ve specified. Going line by line, those commands are:
- Use the
rocker/r-ver:4.2.0
image as the base image. In Docker, base images are useful because they come with things (such as the R language) pre-installed, so you don’t need to install them yourself.rocker/r-ver:4.2.0
ships with R version4.2.0
pre-installed, which means you can run R as you would on your local. - After declaring the base image, we specify a command to run when
docker run
is invoked. This command is simple – it just printsHello from Docker!
.
Once the build has completed, you can:
docker run example
and you should see:
[1] "Hello from Docker!"
Tada 🎉! You just ran R in a Docker container. And since you have your code running in Docker, you could now run the same code on any other machine that supports Docker.
< section id="more-complicated-builds" class="level2">More Complicated Builds
Of course, this example was trivial. In the real world, our projects are much more complex. They have dependencies, they rely on environment variables, they have scripts that need to be run, and so on.
< section id="copying-files" class="level3">Copying Files
Let’s start with running a script instead of running R from the command line as we have been.
Create an R script called example.R
that looks like this:
## example.R print("Hello from Docker!")
And then you can update the Dockerfile by adding a COPY
command to copy the script into your image, as follows.
## Dockerfile FROM rocker/r-ver:4.2.0 COPY example.R example.R CMD ["Rscript", "example.R"]
The COPY
command tells Docker that you want to take example.R
and put it into your image at /example.R
. You can also specify a file path in the image, but I’m just putting the files I copy in at the root.
Finally, let’s build and run our Docker image again:
docker build --tag example . docker run example
Amazing! You can see in the build logs that the example.R
script was copied into the image:
=> [2/3] COPY example.R example.R
and then running the image gives the same result as before:
[1] "Hello from Docker!"< section id="installing-dependencies" class="level3">
Installing Dependencies
You’ll generally also need to install dependencies, which you can do using the RUN
command. Let’s update the Dockerfile to install glue
.
## Dockerfile FROM rocker/r-ver:4.2.0 COPY example.R example.R RUN Rscript -e "install.packages('glue')" CMD ["Rscript", "example.R"]
Now, the third step in the build installs glue
. And to show it works, we’ll use glue
to do a bit of string interpolation, printing the R version that’s running from the R_VERSION
environment variable. Update example.R
as follows:
## example.R library(glue) print(glue('Hello from Docker! I am running R version {Sys.getenv("R_VERSION")}'))
Building and running again should give you some new output. First, you should see glue
installing in the build logs:
=> [3/3] RUN Rscript -e "install.packages('glue')"
And once you run the image, you should see:
Hello from Docker! I am running R version 4.2.0
Woohoo! 🥳🥳
< section id="using-renv" class="level3">Using renv
But as I wrote about in my last post, having global dependency installs is usually a bad idea. So we probably don’t want to have an install.packages()
as a RUN
step in the Dockerfile. Instead, let’s use renv
to manage our dependencies.
From the command line, run:
Rscript -e "renv::init()"
Since you already are using glue
in your project, this will generate a lockfile that looks something like this:
{ "R": { "Version": "4.2.0", "Repositories": [ { "Name": "CRAN", "URL": "https://cloud.r-project.org" } ] }, "Packages": { "glue": { "Package": "glue", "Version": "1.6.2", "Source": "Repository", "Repository": "CRAN", "Requirements": [ "R", "methods" ], "Hash": "4f2596dfb05dac67b9dc558e5c6fba2e" }, "renv": { "Package": "renv", "Version": "0.17.3", "Source": "Repository", "Repository": "CRAN", "Requirements": [ "utils" ], "Hash": "4543b8cd233ae25c6aba8548be9e747e" } } }
It’s important to keep the version of R you’re running in your Docker containers in sync with what you have on local. I’m using 4.2.0
in my Docker image, which I defined with FROM rocker/r-ver:4.2.0
, and that version is the same version that’s recorded in my renv.lock
file. In Python, you might use a tool like pyenv
for managing Python versions.
Now that we have renv
set up, we can update the Dockerfile a bit more:
## Dockerfile FROM rocker/r-ver:4.2.0 COPY example.R example.R COPY renv /renv COPY .Rprofile .Rprofile COPY renv.lock renv.lock RUN Rscript -e "renv::restore()" CMD ["Rscript", "example.R"]
Now, we’re copying all of the renv
scaffolding into the image. And instead of running install.packages(...)
, we’ve replaced that line with renv::restore()
which will look at the lockfile and install packages as they’re defined. Rebuilding and running the image again will give you the same result as before.
Now that we have a script running in Docker and are using renv
to declare and install dependencies, let’s move on to…
Environment Variables
Sometimes we need environment variables like a Github token or a database URL, either to install our dependencies or to run our code. Depending on when the variable will be used, we can either specify it at build time (as a build arg) or a run time. Generally, it’s a good idea to only specify build args that you really need at build time.
< section id="build-time-config" class="level4">Build Time Config
For instance, if your build requires downloading a package from a private Github repository (for which you need to have a GITHUB_PAT
set), then you would specify your GITHUB_PAT
as a build arg). Let’s try that:
## Dockerfile FROM rocker/r-ver:4.2.0 ARG GITHUB_PAT ## Note: don't actually do this ## It's just for the sake of example RUN echo "$GITHUB_PAT" COPY example.R example.R COPY renv /renv COPY .Rprofile .Rprofile COPY renv.lock renv.lock RUN Rscript -e "renv::restore()" CMD ["Rscript", "example.R"]
Now, the second line adds a build arg using ARG
. Next, run the build:
docker build --tag example --build-arg GITHUB_PAT=foobar .
You should see the following in the logs:
=> [2/7] RUN echo "foobar"
This means that your variable GITHUB_PAT
has been successfully set, and can be used at build time for whatever it’s needed for.
This is just an example, but it’s important that you don’t expose secrets in your build like I’ve done here. If you’re using (e.g.) a token as a build arg, make sure it’s not printed in plain text to your logs.
Runtime Config
Other times, you want config to be available at container runtime. For instance, if you’re running a web app, you might not need to be able to connect to your production database when you’re building the image that houses your app. But you need to be able to connect when the container boots up. To achieve this, use --env
(or --env-file
). We’ll update our example.R
a bit to show how this works.
## example.R library(glue) print(glue('Github PAT: {Sys.getenv("GITHUB_PAT")}')) print(glue('Database URL: {Sys.getenv("DATABASE_URL")}')) print(glue('Hello from Docker! I am running R version {Sys.getenv("R_VERSION")}.'))
And then, let’s rebuild:
docker build --tag example --build-arg GITHUB_PAT=foobar .
and now we’ll run our image, but this time with the --env
flag:
docker run --env DATABASE_URL=loremipsum example
This tells Docker that you want to pass the environment variable DATABASE_URL=loremipsum
into the container running your example
image when the container boots up.
And after running, you’ll see something like this:
Github PAT: Database URL: loremipsum Hello from Docker! I am running R version 4.2.0.
There are a few things to note here.
- The
GITHUB_PAT
that you set as a build arg is no longer accessible at runtime. It’s only accessible at build time. - The
DATABASE_URL
we provided with the--env
flag is now accessible as an environment variable namedDATABASE_URL
Very often, container orchestration platforms like Heroku, Digital Ocean, AWS Batch, etc. will allow you to specify environment variables via their CLI or UI, which they will then inject into your container for you when it boots up.
Advanced Topics
This post is intended to be a gentle introduction to Docker, but there’s a lot that’s missing. I’d like to quickly address a couple more topics that have been helpful to me and our team as we’ve relied more and more heavily on Docker.
< section id="custom-base-images" class="level3">Custom Base Images
You might have noticed that your builds take longer as they do more. For instance, the glue
install, even though glue
is extremely lightweight, takes a few seconds. If you have many dependencies (R dependencies, C dependencies, etc.), building an image for your project can get prohibitively slow. For us in the past, builds have taken over an hour just restoring the dependencies recorded in the lockfile.
A convenient way around this is to install some “base” dependencies that you’ll update rarely and use often into a base image, which you then push to a repository like Docker Hub and then use as your base image in the FROM ...
line of your Dockerfile for any particular project. This prevents you from needing to install the same, unchanged dependencies over and over again.
We’ve had a lot of success using this strategy on a few particular fronts:
- Making sure we’re using the same version of R everywhere is simple if we define the R version in one place with a
FROM rocker/r-ver:4.2.0
in our base image (which is calledcollegevine/r-prod-base
), and then we useFROM collegevine/r-prod-base
as the base image for all of our other Docker builds. - Installing Linux dependencies, such as
curl
,unzip
, etc. which we’re happy keeping on a single version can happen once in the base image, and then every downstream image can rely on those same dependencies. - Installing CLIs like the AWS CLI, which again, really doesn’t need to happen on every build.
CI / CD
The other time-saving strategy we’ve greatly benefited from is aggressive caching R packages in our CI / CD process. renv
has great docs on using it within a CI / CD rig which I would highly recommend.
At a high level, what we do is renv::restore()
in the CI itself (before running the docker build ...
), which installs all of the packages our project needs. Then we COPY
the cache of packages into our image, so that they’re available inside of the image. This means we don’t need to reinstall every dependency on every build, and has probably sped up our image build times by 100x.
Wrapping Up
I hope this post has demystified Docker a bit and helped clarify some of the basics of Docker and how it’s used. At the highest level, Docker lets you package up your code so that it can be run anywhere, whether that’s on your machine, on the machine of a coworker, in a CI/CD tool, on a cloud server like an EC2 box, or anywhere else. All you need to do is build
and push
your image!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.