Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Agglomerative Hierarchical Clustering with Euclidian Metric
Let’s assume eight points:
> pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2)) > plot(pts, pch=16, cex=2, xaxt="n", yaxt="n", xlim=c(0,5), ylim=c(0,6)) > axis(1, at=0:5) > axis(2, at=0:6)
> d <- dist(pts, method="euclidian", upper=TRUE) > round(d,2) 1 2 3 4 5 6 7 8 1 1.00 3.16 3.61 4.12 4.47 3.00 3.16 2 1.00 2.24 2.83 3.16 3.61 3.16 3.00 3 3.16 2.24 1.00 1.00 1.41 3.61 2.83 4 3.61 2.83 1.00 1.41 1.00 3.16 2.24 5 4.12 3.16 1.00 1.41 1.00 4.47 3.61 6 4.47 3.61 1.41 1.00 1.00 4.12 3.16 7 3.00 3.16 3.61 3.16 4.47 4.12 1.00 8 3.16 3.00 2.83 2.24 3.61 3.16 1.00
Now agglomerative hierarcal clustering will combine the two closest clusters at each step until there is nothing left to be combined. In the beginning every single point is considered a cluster by itself. So in the above case for the first step six possible cluster combinations of minimal available distance 1 maybe chosen. Of course euclidian distance is only one option next to several others like squared euclidian or Mahattan distance. Another decision you have to make is how you choose the two points in two clusters between which to calculate the distance.
This is why the initial example would lead to different trees upon different clustering methods:
pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2)) d <- dist(pts, method="euclidian") par(mfrow=c(1,2)) plot(hclust(d,method="single")) plot(hclust(d,method="complete"))
Inferring Clusters from a Tree
For complete linkage you can get either 8, 4, 3, 2 or 1 cluster depending on where to cut the tree. If you cat at height 4 you get two clusters – if you cut at 2 you’ll get three clusters. A human being would most likely see three clusters in above example. Why? Because the pattern invites the viewer to distinguish the observable distances into two categories – relatively close and relatively distant. Actually this reasoning can be very easily inferred from the trees above. The branching height is the distance between the two clusters the branching refers to. With complete linkage the clusters {3,4} and {5,6} have a distance of about 1.5 (actually it is sqrt(2)). {1,2} and {7,8} a distanced about 3 (actually it is sqrt(1² + 3²)). And if you think of where to cut off the tree then the most sensible height would be between the two most distant branchings. Above it the “relatively distant” distances are found and below it the “relatively short” distances.
A way to visualize this idea would be by drawing a histogram or a density of the heights present in the tree. This density is displayed under “density of branching heights” in above screen shot. Something around the local minimum between the two modes would be reasonable choice for a cutting height.
This heuristic might be sufficient already but of course more complex techniques are conceivable. One might differentiate between different forkings on a high level or take the branching density into account. The latter idea is depicted under “num of clusters (y) when cutting at height (x)”. You would want to cut the tree at a point where below it suddenly a lot of branchings are taking place.
Implementation of Maximum Gap Heuristic
And this is how a tree cutting somewhere within the maximum branching heights gap might be implemented in R.
> pts <- data.frame(x=c(1,1,2,3,2,3,4,4), y=c(1,2,4,4,5,5,1,2)) > d <- dist(pts, method="euclidian") > h <- hclust(d, method="complete") > h$height [1] 1.000000 1.000000 1.000000 1.000000 1.414214 3.162278 4.472136 > > i <- which.max(diff(h$height)) > cut_height <- (h$height[i] + h$height[i+1])/2 > cut_height [1] 2.288246 > > clusters <- stats::cutree(h, h=cut_height) > clusters [1] 1 1 2 2 2 2 3 3 > > par(mfrow=c(1,2)) > plot(h) > abline(h=cut_height, col="red", lty=2) > plot(pts, col=c("red","green","blue")[clusters], pch=16)
And this is what it would look like:
BTW – in case you want to decorate your dendrogram with those fancy dashed boxes to indicate the clustering then you have to look into Tal Galili’s dendextend package. It’s like a beauty parlor for dendrograms.
Shiny and D3.js
If shinyapps.io is too slow for your liking you can also launch the app right from its GitHub repository by:
shiny::runGitHub("hclust-shiny", "joyofdata")
I really have to say Shiny is an awesome product pretty much in every regard. It is super-intuitive and easy to get started with – the reactivity-framework feels natural and allows you to implement client-server-interaction in very straight forward way. Definitely noteworthy is the awesome documentation providing bite-sized introductions into the various aspects of Shiny, featuring great examples.
The integration of Shiny with D3.js I realized by D3.js writing its state-changes into a hidden text input and then triggering a change event. Less straightforward seems to be influencing client-side JavaScript code from server-side Shiny. It seems that this is done by injecting JavaScript into a htmlOutput()-section. But maybe there is a better method – for this app I only had to inject CSS to color the clusters – so I didn’t put too much effort into solving this problem.
What I do not like so much is the layouting framework. Maybe I just didn’t use it optimally but it seems that it tends to use screen estate too lavishly. Given that the perceived coolness of an app is proportional to the number of plots and charts placed on the screen – you know cockpit-dashboard-style – effective layouting is something to consider. But Shiny offers you to tailor-make the app-interface yourself by simply writing the HTML, CSS and JavaScript yourself. So it’s not really a problem anyway.
Something else to keep in mind is that the reactivity framework makes it easy to program the client-server-communication but also makes it very easy to end up with Spaghetti-like entanglements of reactive in- and outputs. I guess for a larger app I would have to put some thinking into design patterns keeping the mess at bay.
(original article published on www.joyofdata.de)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.