Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently got back from Strata West 2017 (where I ran a very well received workshop on R
and Spark
). One thing that really stood out for me at the exhibition hall was Bokeh
plus datashader
from Continuum Analytics.
I had the privilege of having Peter Wang himself demonstrate datashader
for me and answer a few of my questions.
I am so excited about datashader
capabilities I literally will not wait for the functionality to be exposed in R
through rbokeh
. I am going to leave my usual knitr
/rmarkdown
world and dust off Jupyter Notebook
just to use datashader
plotting. This is worth trying, even for diehard R
users.
datashader
Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter plotly::ggplotly
).
Currently, to use datashader
you must talk directly to Python
and Bokeh
(i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.
datashader
makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.
Please take a look at these stills from the datashader
US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:
If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:
What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:
and even more powerful than zooming out on a static high-resolution rendering:
datashader
can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.
You need to run datashader
to really see the effect. The html
exports, while interactive, sometimes do not correctly perform in all web browsers.
An R example
I am going to share a simple datashader
example here. Again, to see the full effect you would have to copy it into an Jupyter
notebook and run it. But I will use it to show my point.
After going through the steps to install Anaconda
and Juputer notebook
(plus some more conda install
steps to include necessary packages) we can make a plot of the ggplot2
data example diamonds
ggplot2
renderings of diamonds
typically look like the following (and show of the power and convenience of the grammar):
A datashader
rendering looks like the following:
If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:
Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of datashader
! It is something unexpected in the data which is now jumping out at us.
datashader
is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to R
we can look directly at the data:
> library("dplyr") > diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400) # A tibble: 5 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 3.01 Premium I SI2 60.2 59 18242 9.36 9.31 5.62 2 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 3 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77 5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77
There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.
For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:
The Jupyter notebooks
always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a datashader
issue).
Conclusion
datashader
brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.