Rolling Your Own Jupyter and RStudio Data Analysis Environment Around Apache Drill Using docker-compose
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I had a bit of a play last night trying to hook a Jupyter notebook container up to an Apache Drill container using docker-compose
. The idea was to have a shared data volume between the two of them, but I couldn’t for the life of me get that to work using the the docker-compose
version 2 or 3 (services/volumes) syntax – for some reason, any of the Apache Drill containers I tried wouldn’t fire up properly.
So I eventually (3am…:-( went for a simpler approach, synching data through a local directory on host.
The result is something that looks like this:
The Apache Drill container, and an Apache Zookeeper container to keep it in check, I found via Dockerhub. I also reused an official RStudio container. The Jupyter container is one I rolled for TM351.
The Jupyter and RStudio containers can both talk to the Apache Drill container, and both analysis apps have access to their own data folder mounted in an application folder in the current directory on host.The data folders mount into separate directories in the Apache Drill container. Both applications can query into data files contained in either data directory as viewable from Apache Drill.
This is far from ideal, but it works. (The structure is as suggested so that RStudio and Jupyter scripts can both be used to download data into a data directory viewable from the Apache Drill container. Another approach would be to mount a separate ./data
directory and provide some means for populating it with data files. Alternatively, if the files already exist on host, mounting the host data directory onto a /data
volume in the Apache Drill container would work too.
Here’s the docker-compose.yaml
file I’ve ended up with:
drill: image: dialonce/drill ports: - 8047:8047 links: - zookeeper volumes: - ./notebooks/data:/nbdata - ./R/data:/rdata zookeeper: image: jplock/zookeeper notebook: container_name: notebook-apache-drill-test image: psychemedia/ou-tm351-jupyter-custom-pystack-test ports: - 35200:8888 volumes: - ./notebooks:/notebooks/ links: - drill:drill rstudio: container_name: rstudio-apache-drill-test image: rocker/tidyverse environment: - PASSWORD=letmein #default user is: rstudio volumes: - ./R:/home/rstudio ports: - 8787:8787 links: - drill:drill
If you have docker
installed and running, running docker-compose up -d
in the folder containing the docker-compose.yaml
file will launch three linked containers: Jupyter notebook on localhost port 35200, RStudio on port 8787, and Apache Drill on port 8047. If the ./notebooks
, ./notebooks/data
, ./R
and ./R/data
subfolders don’t exist they will be created.
We can use the clients to variously download data files and run Apache Drill queries against them. In Jupyter notebooks, I used the pydrill
package to connect. Note the hostname used is the linked container name (in this case, drill
).
If we download data to the ./notebooks/data
folder which is mounted inside the Apache Drill container as /nbdata
, we can query against it.
(Note – it probably would make sense to used a modified Apache Drill container configured to use CSV headers, as per Querying Large CSV Files With Apache Drill.)
We can also query against that same data file from the RStudio container. In this case I used the DrillR
package (I had hoped to use the sergeant
package (“drill sergeant”, I assume?! Sigh..;-) but it uses the RJDBC package which expects to find java
installed, rather than DBI
, and java
isn’t installed in the rocker/tidyverse
container I used.) UPDATE: sergeant
now works without Java dependency... Thanks, Bob:-)
I’m not sure if DrillR
is being actively developed, but it would be handy if it could return the data from the query as a dataframe.
So , getting up and running with Apache Drill and a data analysis environment is not that hard at all, if you have docker
installed:-)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.