Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.
So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.
The result is something that looks like this:

The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.
The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.
This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data directory and provide some means for populating it with data files. Alternatively, if the files already exist on host, mounting the host data directory onto a /data volume in the Apache Drill container would work too.
Here’s the docker-compose.yaml file I’ve ended up with:
drill:
image: dialonce/drill
ports:
- 8047:8047
links:
- zookeeper
volumes:
- ./notebooks/data:/nbdata
- ./R/data:/rdata
zookeeper:
image: jplock/zookeeper
notebook:
container_name: notebook-apache-drill-test
image: psychemedia/ou-tm351-jupyter-custom-pystack-test
ports:
- 35200:8888
volumes:
- ./notebooks:/notebooks/
links:
- drill:drill
rstudio:
container_name: rstudio-apache-drill-test
image: rocker/tidyverse
environment:
- PASSWORD=letmein
#default user is: rstudio
volumes:
- ./R:/home/rstudio
ports:
- 8787:8787
links:
- drill:drill
If you have docker installed and running, running docker-compose up -d in the folder containing the docker-compose.yaml file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks, ./notebooks/data, ./R and ./R/data subfolders don’t exist they will be created.
We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill package to connect. Note the hostname used is the linked container name (in this case, drill).

If we download data to the ./notebooks/data folder which is mounted inside the Apache Drill container as /nbdata, we can query against it.
(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)
We can also query against that same data file from the RStudio container. In this case I used the DrillR package (I had hoped to use the sergeant package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java installed, rather than DBI, and java isn’t installed in the rocker/tidyverse container I used.) UPDATE: sergeant now works without Java dependency... Thanks, Bob:-)
I’m not sure if DrillR is being actively developed, but it would be handy if it could return the data from the query as a dataframe.

So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker installed:-)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.