Experimenting with Hierarchical Clustering in a galaxy far far away…

Jamie Lendrum

3 years ago

[This article was first published on R on R-house, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This post will be taking a bit of an unexpected diversion. As I was experimenting with hierarchical clustering I ran into the issue of how many clusters to assume. From that point I went deep into the rabbit hole and found out some really useful stuff that I wish I’d have known when I wrote my previous post.

I’ve discovered that choosing a number of clusters is a whole topic in itself, and there are, in general, two ways of validating a choice of cluster number:

Internal validation indices – these use the properties of the data itself to determine the optimal number of clusters. There are literally dozens of these metrics, which essentially replace the problem of ‘how many clusters do I choose?’ to ‘which metric do I use?’ – more on this later!
External validation – this includes using ‘known’ cluster labels (as I had in my last post)…but half the fun is seeing what the data itself says!

So, I decided to focus more deeply on internal validity and discovered a 2001 paper by Halkidi and Vazirgiannis proposing an index called S_Dbw (Scattering and Density Between). The index seemed to perform well, and another paper by Liu et al went even further by comparing it with many other widely used indices. This comparison tested 5 aspects of clustering, and S_Dbw was the only one to correctly identify the number of clusters in all cases!

Could I have found my silver bullet?

Eager to try it out for myself, I discovered the NbClust package and read its documentation. I did notice while reading, that the S_Dbw metric suggested that the number of clusters in the iris dataset was 10 (!)

However the disappointment was shortlived as the NbClust package looks really useful. It has many validation metrics within it, and best of all, it can calculate them all in one function call and recommend a number of clusters based on the majority vote, which in my mind seems a bit more robust.

So I’m going to first go back and revisit some of the analysis in my previous post, before moving onto hierarchical clustering. I’ll be using the following packages:

library(tidyverse)
library(NbClust)
theme_set(theme_bw())

The NbClust view

Here is the data I had previously scraped.

swgoh_stats <- read_csv("content/post/data/unsupervised-learning/swgoh_stats.csv")
swgoh_scaled <- scale(swgoh_stats[,-1])
swgoh_roles <- read_csv("content/post/data/unsupervised-learning/swgoh_roles.csv") %>%
  mutate(role = case_when(!is.na(attacker) ~ "Attacker",
                          !is.na(support) ~ "Support",
                          !is.na(tank) ~ "Tank",
                          !is.na(healer) ~ "Healer")) %>% 
  select(`Character Name`, role)

Before doing anything else I’m going to see how many clusters the NbClust() function recommends:

NbClust(swgoh_scaled, distance = "euclidean", method = "kmeans")
## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 2 proposed 12 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
## $All.index
##         KL      CH Hartigan    CCC     Scott      Marriot    TrCovW
## 2   0.9879 47.8476  46.8668 0.0343  276.0360 3.140149e+31 40710.600
## 3   6.9271 53.4919  11.8479 2.4116  614.6586 1.031701e+31 17024.154
## 4   1.9541 41.8097   7.6923 1.7436  740.6484 8.964808e+30 14615.346
## 5   0.3752 34.4811  11.7263 0.8740  775.2750 1.150580e+31 13070.155
## 6   3.2719 31.6357   5.6969 1.7136 1046.5952 3.546293e+30 11728.891
## 7   0.3635 28.0302   9.6189 1.1360 1062.9247 4.399201e+30 10690.368
## 8   8.0983 26.6091   1.6658 2.0726 1248.5694 2.001079e+30 10132.264
## 9   0.1237 23.5808   9.1109 0.4677 1240.2348 2.655434e+30  9754.703
## 10  2.8166 22.9782   4.4561 1.5187 1420.0331 1.180276e+30  9082.395
## 11  0.2924 21.5505  10.6233 1.0529 1452.0060 1.190893e+30  8159.294
## 12 20.9265 21.6862   1.2317 3.0845 1587.4642 6.564413e+29  7360.076
## 13  0.6067 20.0082   2.2031 1.9743 1601.9958 7.093524e+29  7093.830
## 14  0.2623 18.7724   4.0431 1.2979 1604.9163 8.091423e+29  6960.353
## 15  0.4895 18.0432   6.8441 1.3062 1688.4084 5.779998e+29  6311.849
##      TraceW Friedman  Rubin Cindex     DB Silhouette    Duda Pseudot2
## 2  2196.102  18.4178 1.2750 0.3351 1.8944     0.2096  1.4831 -40.7199
## 3  1730.101  34.7660 1.6184 0.3492 1.4922     0.2595  1.1740 -16.8940
## 4  1619.209  37.5966 1.7292 0.3341 1.8933     0.1847  1.0343  -2.9838
## 5  1549.894  36.0928 1.8066 0.3597 2.0448     0.1360  2.1336 -31.3467
## 6  1450.431  45.7343 1.9305 0.3401 2.0724     0.1403  1.7784 -22.7608
## 7  1403.401  44.6696 1.9952 0.3359 2.1658     0.1220  0.9013   1.3142
## 8  1327.826  50.0555 2.1087 0.3320 1.8608     0.1430 14.6467 -37.2690
## 9  1314.789  47.3518 2.1296 0.2918 2.2077     0.1063  0.8099   1.1737
## 10 1246.770  54.2760 2.2458 0.3291 1.8834     0.1254 10.0472 -20.7108
## 11 1214.176  52.9269 2.3061 0.3202 1.9492     0.1073  1.2512  -5.0186
## 12 1140.732  56.3496 2.4546 0.3040 1.6838     0.1363 11.3341 -32.8237
## 13 1132.229  55.9126 2.4730 0.3437 1.8059     0.1271 14.3967 -29.7773
## 14 1117.129  54.8679 2.5064 0.3150 1.8031     0.1080  1.4219  -5.9344
## 15 1089.928  58.2568 2.5690 0.3177 1.7510     0.1159  1.9705 -15.7602
##      Beale Ratkowsky      Ball Ptbiserial     Frey McClain   Dunn Hubert
## 2  -3.5597    0.2819 1098.0509     0.4127  -0.2789  0.7838 0.1911 0.0009
## 3  -1.6203    0.3240  576.7002     0.6014   1.5881  0.9888 0.2159 0.0014
## 4  -0.3618    0.3010  404.8023     0.5429   3.7032  1.5169 0.1672 0.0014
## 5  -5.6606    0.2878  309.9788     0.4741   0.4185  2.1510 0.1953 0.0014
## 6  -4.6850    0.2706  241.7385     0.4602   1.1971  2.8002 0.1818 0.0016
## 7   1.1389    0.2570  200.4858     0.4403  -0.1664  3.1801 0.1703 0.0016
## 8   0.0000    0.2496  165.9782     0.4735  -4.9422  3.1096 0.2039 0.0017
## 9   2.3578    0.2367  146.0876     0.3781  -0.2168  4.8688 0.1566 0.0017
## 10  0.0000    0.2308  124.6770     0.4050   0.6958  4.5901 0.1915 0.0018
## 11 -2.0794    0.2236  110.3796     0.3739  -0.2538  5.7208 0.1794 0.0019
## 12  0.0000    0.2201   95.0610     0.4121   2.9398  5.1222 0.1915 0.0021
## 13 -5.1407    0.2119   87.0945     0.3860 -10.0280  5.9149 0.1996 0.0019
## 14 -2.9505    0.2054   79.7950     0.3518   0.4371  7.1449 0.1853 0.0020
## 15 -5.1015    0.2001   72.6619     0.3444   0.0494  7.5748 0.1882 0.0020
##    SDindex Dindex   SDbw
## 2   1.2137 3.3660 0.8692
## 3   0.9673 3.0084 0.7205
## 4   1.1434 2.9042 0.6820
## 5   1.2753 2.8395 0.6512
## 6   1.2659 2.7410 0.6485
## 7   1.2423 2.7026 0.6095
## 8   1.3424 2.6465 0.6509
## 9   1.3082 2.6129 0.5780
## 10  1.4906 2.5608 0.6229
## 11  1.2901 2.5154 0.5744
## 12  1.3782 2.4612 0.5859
## 13  1.3520 2.4377 0.5456
## 14  1.1702 2.4169 0.5264
## 15  1.1710 2.3802 0.5092
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.8453            22.8792       1.0000
## 3          0.8483            20.3887       1.0000
## 4          0.8385            17.3373       1.0000
## 5          0.7555            19.0928       1.0000
## 6          0.7683            15.6803       1.0000
## 7          0.7004             5.1334       0.3193
## 8          0.1807           181.3138          NaN
## 9          0.6420             2.7882       0.0036
## 10         0.1807           104.2554          NaN
## 11         0.6929            11.0826       1.0000
## 12         0.1807           163.1824          NaN
## 13         0.3238            66.8342       1.0000
## 14         0.6278            11.8551       1.0000
## 15         0.6929            14.1857       1.0000
## 
## $Best.nc
##                      KL      CH Hartigan     CCC    Scott      Marriot
## Number_clusters 12.0000  3.0000   3.0000 12.0000   3.0000 3.000000e+00
## Value_Index     20.9265 53.4919  35.0189  3.0845 338.6226 1.973229e+31
##                   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters     3.00   3.0000   3.0000  3.0000 9.0000 3.0000
## Value_Index     23686.45 355.1098  16.3482 -0.2326 0.2918 1.4922
##                 Silhouette   Duda PseudoT2   Beale Ratkowsky     Ball
## Number_clusters     3.0000 2.0000   2.0000  2.0000     3.000   3.0000
## Value_Index         0.2595 1.4831 -40.7199 -3.5597     0.324 521.3507
##                 PtBiserial Frey McClain   Dunn Hubert SDindex Dindex
## Number_clusters     3.0000    1  2.0000 3.0000      0  3.0000      0
## Value_Index         0.6014   NA  0.7838 0.2159      0  0.9673      0
##                    SDbw
## Number_clusters 15.0000
## Value_Index      0.5092
## 
## $Best.partition
##   [1] 1 2 1 1 2 1 1 1 2 2 2 3 1 2 1 1 1 2 2 2 1 1 2 1 1 3 3 1 1 2 1 2 1 2 1
##  [36] 1 1 3 3 2 1 2 1 3 3 1 2 3 1 3 1 1 3 1 3 1 3 1 1 1 1 2 1 1 3 2 1 2 1 1
##  [71] 2 2 3 2 1 1 3 3 3 1 1 2 2 1 1 2 2 1 1 1 2 1 1 3 2 1 3 1 1 3 2 2 1 2 2
## [106] 1 1 2 2 1 2 1 1 1 3 1 1 1 3 3 1 1 2 1 2 2 1 1 2 3 2 3 1 1 3 3 1 1 1 1
## [141] 1 1 1 2 1 1 2 1 2 1 1 2 1 2 2 2 3 3 1 1 1 1 3 1 1 1 1 1 1 2 1 1 1 1 2
## [176] 1

Interestingly, the most recommended number of clusters by far is 3, which tallies with the results I got when I tried 4 clusters. Let’s see what this looks like. It’s worth noting that currently broom does not work with NbClust objects, so I’ll use the same code as I used before, uesing the kmeans() function found in base R.

set.seed(125)
kmeans(swgoh_scaled, centers = 3, nstart = 50) %>%
  broom::augment(swgoh_scaled) %>%
  bind_cols(swgoh_stats[,1]) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

We can see that this clusters slightly better, but there does still seem to be some overlap.

Hierarchical Clustering

The two forms of hierarchical clustering essentially describe whether they are going “bottom up” or “top down”. The former is given the name Hierarchical Agglomerative Clustering (HAC) or Agglomerative Nesting (AGNES) and is the most common. The latter is called Divisive Analysis Clustering (DIANA).

Here, we’ll focus on HAC or “T-1000 clustering” (as I like to call it) as shown in the video below.

The algorithm is accessed through the hclust() function in base R, and the good news is we don’t have to (initially) supply a number of clusters, nor a number of repeats, because the algorithm is deterministic rather than stochastic in that it eventually tries all possible numbers of clusters. The bad news however is that we must supply two other inputs:

A distance metric
A clustering method

These essentially answer the two key questions the algorithm has in performing HAC:

How would you like to measure distance?
How would you like to define distance between two clusters? (linkage criterion)

The most intuitive distance metric is straight line distance between two points (euclidean) and is probably the most often used, but other possibilities include “manhattan” (summing up the vector component distances) and “maximum” (taking the maximum of the vector components distances).

There are also many ways of calculating the distance between two clusters, including: taking the distance between the two closest points (“single”), or the distance between the furthest points (“complete”), taking the average or median distance between all points in the two clusters (“average”/“median”), minimising the variance between clusters (Ward’s method). If you wanted to cluster by correlation rather than distance, you could provide the hclust() function the value of 1 - abs(cor(x)).

The first thing I’ll do here is untidy my data by putting the character names as row names, as it will prove useful when I visualise later.

swgoh_stats_rn <- read_csv("content/post/data/unsupervised-learning/swgoh_stats.csv") %>% 
                column_to_rownames("Character Name")
swgoh_scaled <- scale(swgoh_stats_rn)

The hclust() function takes a distance matrix as input, so a simple model call would look something like this:

hclust_model <- swgoh_scaled %>% 
                dist("euclidean") %>% 
                hclust("complete")

hclust_model
## 
## Call:
## hclust(d = ., method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 176

Unlike kmeans(), this model doesn’t really give anything useful when it’s printed out. Also, the broom package doesn’t seem to work on hclust objects, but this isn’t such a big deal as half the point of hierarchical clustering is being able to see the clustering evolution visually. On that note…

Dendrograms

The usual way hierarchical clustering is visualised is through a dendrogram. Note that the character names have been included as the information was contained in the row names. Without this, it would just show row numbers. The branches tend to get a bit shorter when height is below 8, suggesting the best number of clusters is below less than 8.

plot(hclust_model)

We can align the labels a bit more nicely by setting the hang parameter to -1:

plot(hclust_model, hang = -1)

I wanted to find out which combinations of distance metric and clustering method provided the best results. I tried all combinations and found that the Ward methods seemed to provide the most clear clustering. However during the course of my research I found a goldmine on information on Stack Overflow from a user called ttnphns who seemed to be very knowledgeable on the subject of clustering, offering some great guidance:

Do not choose your method by visually comparing dendrograms;
Do not decide on the number of clusters by looking at dendrograms created by the Ward method;
Decide on the distance metric consciously, according to what makes sense for your problem, rather than blindly trying different ones;
Ensure you use distance metrics required by certain methods, e.g. Ward/centroid require euclidean distance;
Hierarchical clustering is generally not recommended for problems with thousands of observations;

All that said, from what I’ve read, the single and centroid methods are used much less often than complete/average/Ward. For my analysis, I tried using the NbClust() function using Ward, complete, and average linkage methods. The recommended number of clusters were:

Ward – 3
Complete – 4
Average – 2 or 6

I’m interested to see the dendrogram for each of these methods, as well as the wordcloud, which I’ll generate by using the cutree() function to cut the tree at the recommended number of clusters to get the cluster assignments for each character.

First up, Ward’s method:

hclust_model <- swgoh_scaled %>% 
                dist("euclidean") %>% 
                hclust("ward.D2")
plot(hclust_model, hang = -1)

swgoh_stats %>%
  bind_cols(.cluster = cutree(hclust_model, k = 3)) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

With Ward’s method we get similar results to k-means.

hclust_model <- swgoh_scaled %>% 
                dist("euclidean") %>% 
                hclust("complete")
plot(hclust_model, hang = -1)

swgoh_stats %>%
  bind_cols(.cluster = cutree(hclust_model, k = 4)) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

This wordcloud is very similar to the one using Ward’s method, however two characters have been separated into their own cluster – these characters have been released very recently.

hclust_model <- swgoh_scaled %>% 
                dist("euclidean") %>% 
                hclust("average")
plot(hclust_model, hang = -1)

swgoh_stats %>%
  bind_cols(.cluster = cutree(hclust_model, k = 2)) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

Using average linkage with two clusters is not very helpful at all.

hclust_model <- swgoh_scaled %>% 
                dist("euclidean") %>% 
                hclust("average")
plot(hclust_model, hang = -1)

swgoh_stats %>%
  bind_cols(.cluster = cutree(hclust_model, k = 6)) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

Despite this clustering looking quite awful and imbalanced, the characters that are in the sparse clusters almost all have something in common – they are among the newest characters released/reworked, and this definitely supports the idea that the developers are trying differently balanced characters.

I think in terms of clustering performance, Ward’s method seems to be better in this case, however the average method has brought out distinctions in newer characters.

In my next post I’ll be experimenting with some different visualisation packages for hierarchical clustering output.

To leave a comment for the author, please follow the link and comment on their blog: R on R-house.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.