Deep Down Below – Using in-database analytics from within Tableau (with MADlib)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Using Tableau for visualizing all kinds of data is quite a joy, but it’s not that strong on build-in analytics or predictive features. Tableaus integration of R was a huge step in the right direction (and I love it very much – see here, here and here) but still has some limitations (e.g. no RAWSQL like functions to work on a row level; dimension restrictions regarding input and output of R table calculations) making it more difficult to work with the results. That’s why I’m looking for additional software for my own analytical toolbox that offer me an easy way to integrate different types of statistical modeling in Tableau.
That’s why MADlib, a “plug-in” for PostgreSQL or Pivotal database attracted me. It offers a lot of machine learning and statistical functionalities inside the database. Despite the fact that MADlib is limited to those two databases (but PostgreSQL is free and open source) it comes with a strong advantage: there is no need to move the data that should be used for modeling – everything is done directly inside the database. The picture below shows the difference in the data flow when using MADlib instead of R for running data mining algorithms.
This blog post shows how we can execute machine learning algorithms in MADlib from Tableau, get the modeling results and create some advanced visualizations to support the transformation from statistical results into real-world insights. Therefore, we
- start with simple k-means clustering, to show how to connect to PostgreSQL /MADlib,
- go over to decision trees, where stored procedures are applied to deploy small analytical workflows and
- end with association rule mining demonstrating how Tableau can be used to make our mining results visible and understandable for the analyst (with a “little” help of R).
Prerequisite
Because PostgreSQL and MADlib are not shipped together as part of Tableau, both need to be installed separately. This paragraph can’t give a step-by-step walkthrough about that, but instead offer some tips, which hopefully help you start as quickly as possible:
Whereas the installation of PostgreSQL is well documented for all types of OS, the installation of MADlib is more difficult (you can find the official installation documentation here). Installation packages for CentOS and Redhat (RPM) exists, but not for other Linux/Unix distributions or windows. First I tried the MADlib installation in Ubuntu, but failed with some cryptic error messages (I’m not a Unix expert that’s what cryptic means). Therefore, I decided to follow this more easy way, which is sufficient for testing purposes:
- Install some VM software (for example Virtual Box).
- Install and configure the correct CentOS distribution (for Virtual Box you can get an image here) – you need a 64bit version.
- Install PostgreSQL according to MADlib installation document.
- Install MADlib according to MADlib installation document.
- Configure MADlib according to MADlib installation document (I configured PostgreSQL to allow only local connections).
-
To connect from my Window 8 host I use putty with SSH port forwarding configured as seen in the screenshot.
Additionally you have to set the VM network configuration to “Host-only Adapter”.
- Try it – start the VM -> connect with putty -> open Tableau and test if you can connect.
Next we need to fill PostgreSQL with data for our examples. During this blog post I use three different datasets (all included with R). The following three code snippets show how to load them into PostgreSQL.
First dataset is the well-known iris dataset. It is used to demonstrate k-means clustering. The interesting point in the upload script is that we have to use an array as data structure to store the feature values per individual as requested by MADlibs k-means implementation. Everything else is pretty straightforward: Second dataset is named car90 (more than 30 features describing about 111 cars ) from the rpart packages for showing classification by using decision trees: And at last a dataset about real-world point-of-sale transactions that is ideal for testing association rule mining. The dataset is part of the arules package. To be able to use MADlibs implementation for association rules, we need to transform the dataset into a key-value list, where the key identifies the transaction and the value is the name of the product being part of this transaction: After all this preparations we are now ready for the first Tableau visualization regarding k-means.
Clustering using k-means
The k-means implementation of MADlib needs as input the source table plus additional parameters like the number of clusters as well as the distance function. It returns the identified centroids. Those centroids can be used directly to classify a set of existing points with the function ‘madlib.closest_column’. To control the k-means algorithm we use a Custom SQL connection to pass parameters from Tableau.
From the database query Tableau automatically converts the float array containing the iris flower features into a string. To extract petal length, width, sepal length and width we need to split the string based on the delimiter “;”. Extracting single elements from a string is a common pattern especial when using R table calc (put all return values for one entry of the result set into one string separated by a common delimiter to avoid multiple computations). A concise and easy way for splitting a string, is to build a calculated field for every position of the delimiter and use those fields afterwards to extract a specific feature depending on its position inside the string.
Dragging every single feature that was used for clustering on the row and the column shelf give us a scatter plot for visualizing the clustering error. We use the color to show the true and shape to show the expected class. On the created dashboard one can interactively explore how changing the given parameters will influence the clustering.
Classification using C4.5 (Decision trees)
The second example with MADlib is a little bit more complex. We use decision trees to predict the class on a hold-out set, whereas the model was build on a separate training dataset. This can be useful in various ways. Either 1) to validate performance when tweaking model parameters (building a model) or 2) to use a model with known performance build with the available production data to predict the class for new data with unknown label (using a model in production).
For this the following short SQL snippet prepares the stage by splitting the data into train and test set: Now we can use MADlib build-in functions for decision trees to work directly with those two created tables. The overall workflow is as follows:
- clean up old results
- create model on training data
- classify data from the test set with the developed model
Because we want to use the same logic as in the clustering example to control the modeling process and retrieve the data, we wrap the whole MADlib decision tree code into a stored procedure. These stored procedure can then be called as part of a custom SQL connection. The stored procedure looks as follows: Everything relevant for controlling the algorithm is part of the procedure header and can be called from Tableau by using parameters to pass values to MADlib. Also note that the stored procedure can be easily extended to incorporate additional parameters – for example table and attribute names . That will allow changing the data source for training and testing as well as the relevant feature set on the fly without modifying the procedure itself!
The data connection joins the result of the classification with the raw data of the individuals to be classified (containing the true class label in this case):
Now we can start building our dashboard that will show a confusion matrix (true class labels versus predicted class labels). Additionally we add a table calculation, which calculates the accuracy of our classifier and is visible when hovering over one of the cells. Changing one of the parameters on the right side of the dashboard now instantly restarts the modeling and classification task and updates the matrix accordingly.
The way we access the predicted label, offers us more opportunities to use it as part of a visualization than if we use the results of R table calculations (sorting, hierarchies, …). One can find an example for this, when comparing this solution for displaying a confusion matrix with the “hack” from my first blog post (simple clustering example with Tableau’s R interface – here). The resulting clean and easy structure of the worksheet is what I like about the MADlib solution.
Visualizing the results for association rule mining
Our last example for MADlib is about association rule mining. It differs from the previous examples because rule mining is more a sequential process, where the first step is to calculate the set of rules that fulfill certain requirements and the second step then is to use visualization (or other tools) to detect relevant pattern. In most cases you will not step back after looking on the results of the first step and change your modeling parameters. Instead you will a lower the thresholds directly from the beginning and filter out irrelevant rules afterwards. Therefore, the focus is on exploring the results instead of controlling the algorithm.
Creating the association rules for the grocery dataset in MADlib is straightforward: The script shows how the apriori algorithm for association rules is started to generate all rules having at least support 0.001 and confidence 0.5. The third and fourth parameter specifies the name of the columns containing transaction id and the item.
Using our transaction data, the result will be a table containing more than 5,800 rules together with four different quality metrics, making it difficult to spot anything meaningful. The question now is, how can visualization help us to search the result for interesting pattern? Thanks to ongoing research we are not the first humans asking that question and so you quickly come across material from Michael Hahsler and Sudheer Chelluboina the creators of the R package “arulesViz”. The documentation of the package contains a well written and comprehensive vignette, describing a couple of static and interactive graphs all presenting different views on the result set. Wouldn’t it be nice to have some of them in Tableau, enriched with additional “interactive features”? Let’s try to implement the scatter plot (p. 5), the matrix-based visualizations (p. 9) including the grouped matrix-based visualization (p. 12) and the graph-based visualizations (p. 16).
We first should think about a good data structure to move the data from PostgreSQL/MADlib to Tableau. This is because it should be avoided to end up with a different data connections for every single visualization: The scatter plot is the most simple one and loading one row per rule would fit (the quality metrics of the rule serve as dimensions in the scatter plot). This is already the structure of the output table of MADlib for association rules. But the matrix-based visualizations is more complex. Because for this we want to split the “Right-hand-side” RHS of the rule into single products, it would be better to have one row for every item on the RHS of a rule (“… => butter & bread” should then be matched to one row for “butter” and one for “bread”). The grouped matrix-based visualization is similar. But this is not enough for the graph based view. Here we need one row per possible combination of one item from the “Left-hand-side” LHS with one from the RHS (“beer & toast & wine => butter & bread” should then be represented with 3 x 2 = 6 rows). For this it makes sense to think about how we can handle Tableau to display graph like structures. I took over Boras approach to combine a point (the nodes) and a line layer (the edges). We need to duplicate every row and add two additional attributes – a binary one to filter out the duplicates afterwards (also used for the path order of the line layer) and a new label that depicts the LHS item in one and the RHS item in the duplicated case. Taken all this into account, we end up with the following SQL statement:
The query contains some “extras” like “ranking features” (one for every of the different quality metrics) that can be used afterwards in the dashboard, to reduce the amount of data to be visualized.
As before we choose a custom SQL connection to grab the data. But because of the described sequential process we may change it to a data extract instead of a live connection. In this case, we get some extra speed, but if the analysts repeats the mining process the extract needs to be refreshed.
Let’s start with the scatter plot visualization showing every rule by its position regarding to support and confidence. Therefore we place both features on the column respectively the row shelf and drag the ruleid to the detail shelf. For color and size we use two calculated fields, each representing a quality metrics that can be chosen by the user. The underlying design pattern is easy: for a dynamic measure you need a calculated field and one parameter having one value per possible metric. In the calculated field use a case statement to switch to the chosen metric.
A second sheet showing detailed information about selected rules will support the visual inspection of the rules. Here we use a simple crosstab format. At the end both sheets are arranged vertically on the dashboard. Last thing missing, is a dashboard action to filter on rules in the list view that got selected in the scatter plot or vice versa. The screenshots below shows the action definition for the scatter plot followed by the final dashboard.
The next visualization is the matrix-based plot, which is straightforward. Take the item from the RHS to columns and the unique antecedent item sets to rows. Color will be based on confidence as well as the sorting of the rows and columns. Additional filters allow to exclude rules below a certain threshold. As aggregation method I used the average because this will allow using an extract as well as a live connection to the PostgreSQL. The final dashboard looks as follows:
The matrix like visualization has some obvious difficulties representing details from this large amount of information. For example, one may be interested to expand the rules at the top, all pointing towards “whole milk” and “other vegetables”. Also there is a slight indication for the existence of some smaller and bigger clusters. A visualization technique to overcome this limitation is clustering the unique item sets from the LHS yielding the so-called grouped matrix-based visualization. The idea is to group antecedent item sets into one cluster if they share similar values for a chosen quality metric over the set of common consequences. This approach supports both ways – to understand the “whole picture” as well as to drill down and inspect individual clusters.
To be honest – I didn’t succeed into rebuilding the complete visualization in Tableau, but I think that the intermediate results are worth to share: The workhorse is an R calculation. It transforms all data about LHS item sets, the items from the RHS of the rule together with the lift metric into a simple table structure suitable for clustering. Every row in this table represents one unique item set from the LHS whereas the columns represent the different items from the RHS. The cell values correspond to the lift values from the corresponding rules.
We then use the simple k-means clustering algorithm to compute the cluster assignment. Before we write back the results from R to Tableau, we compute some additional details representing every cluster (most common item, # of other items, number of occurrences of the most common item).
Back into Tableau we keep the matrix layout, but now visualizing LHS item sets grouped by the corresponding cluster labels.
At this point I had to stop because I didn’t find a way to aggregate the details onto the level of clusters (so changing the visualization from a LHS vs RHS items to cluster vs. RHS items). As soon as I drop the LHS item sets from the visualization the R calculation receives wrong input. Putting the LHS items sets to details is also not a solution. I’m happy if any of the readers have an idea to achieve something that is closer to the R implementation in arulesViz.
But I think the visualization has its value nevertheless, showing an interesting small “unhealthy food” cluster directly at the beginning. And, right after that one finds a sort of “fitness” cluster where for example yogurt and milk is closely related to vegetables and different kinds of fruits.
I parameterized the visualization in a way that will allow the user to choose the number of clusters as well as filtering out rules before they enter the R calculation.
Last visualization to implement in Tableau will be the graph based visualization. This visualization is especially useful to inspect small set of rules or specific items. It gives an easy understandable picture of the relationship between items.
I’m very thankful that a couple of month ago Bora explained a way, how graphs could be easily visualized in Tableau. I will not repeat his explanations, but jump directly into the relevant R table calculation:
The most interesting part here is the calculation of summary statistics for every item. We use those statistics afterwards in Tableau to represent color and size of our nodes. It’s quite nice that because these calculations are dynamically created on the data set given to R, they always represent the filtering of the current view.
The Tableau visualization of the graph consists of two layers – a line layer for the edges and a second point layer for the nodes. Symbols are used to clearly separate between nodes (aka items) that form the consequences (RHS) of the rules (circle) and nodes which only server as LHS items (box). A ranking filter helps us to reduce the amount of data for computing the graph layout in R. This is an extreme important point as the response time of Tableau (including the R calculation) increases, if too many rules got selected and the visualization becomes cluttered. For the ranking filter we use one of our pre-computed features from the database. The final dashboard looks as follows:
Summary
As using advanced analytical models is becoming more and more important, I need to extend the “toolbox” I have at the moment in Tableau. As using R and its libraries was the first step, MADlib adds additional capabilities. Limited of course by the number of algorithms at the moment and the need for a PostgreSQL installation, the integration with custom SQL connections run smoothly and offers a much easier and direct way to extract predicted labels or cluster assignments compared to R. Not mentioned before, but also very important: from my point of view MADlib supports a stronger separation between the statistical engineer and the visual analyst using Tableau and makes it easier for the second one to work with the results (for example by using stored procedures).
As the last example for graph visualization shows – combining the different worlds (Tableau, MADlib and R) allows solving even more complex scenarios using statistical models in different stages of you “visualization workflow” – clearly true Visual Analytics.
Please also find my workbook with all the examples from this blog post attached:
Workbook connecting Tableau and MADlib (Extract).twbx.
(Note as the workbook contains data extracts, you can’t use MADlib functions out of the box. You have to install MADlib before and change the connection settings accordingly).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.