Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This post will be taking a bit of an unexpected diversion. As I was experimenting with hierarchical clustering I ran into the issue of how many clusters to assume. From that point I went deep into the rabbit hole and found out some really useful stuff that I wish I’d have known when I wrote my previous post.
I’ve discovered that choosing a number of clusters is a whole topic in itself, and there are, in general, two ways of validating a choice of cluster number:
Internal validation indices – these use the properties of the data itself to determine the optimal number of clusters. There are literally dozens of these metrics, which essentially replace the problem of ‘how many clusters do I choose?’ to ‘which metric do I use?’ – more on this later!
External validation – this includes using ‘known’ cluster labels (as I had in my last post)…but half the fun is seeing what the data itself says!
So, I decided to focus more deeply on internal validity and discovered a 2001 paper by Halkidi and Vazirgiannis proposing an index called S_Dbw (Scattering and Density Between). The index seemed to perform well, and another paper by Liu et al went even further by comparing it with many other widely used indices. This comparison tested 5 aspects of clustering, and S_Dbw was the only one to correctly identify the number of clusters in all cases!
Could I have found my silver bullet?
Eager to try it out for myself, I discovered the NbClust
package and read its documentation. I did notice while reading, that the S_Dbw metric suggested that the number of clusters in the iris
dataset was 10 (!)
However the disappointment was shortlived as the NbClust
package looks really useful. It has many validation metrics within it, and best of all, it can calculate them all in one function call and recommend a number of clusters based on the majority vote, which in my mind seems a bit more robust.
So I’m going to first go back and revisit some of the analysis in my previous post, before moving onto hierarchical clustering. I’ll be using the following packages:
library(tidyverse) library(NbClust) theme_set(theme_bw())
The NbClust view
Here is the data I had previously scraped.
swgoh_stats <- read_csv("content/post/data/unsupervised-learning/swgoh_stats.csv") swgoh_scaled <- scale(swgoh_stats[,-1]) swgoh_roles <- read_csv("content/post/data/unsupervised-learning/swgoh_roles.csv") %>% mutate(role = case_when(!is.na(attacker) ~ "Attacker", !is.na(support) ~ "Support", !is.na(tank) ~ "Tank", !is.na(healer) ~ "Healer")) %>% select(`Character Name`, role)
Before doing anything else I’m going to see how many clusters the NbClust()
function recommends:
NbClust(swgoh_scaled, distance = "euclidean", method = "kmeans") ## Warning in pf(beale, pp, df2): NaNs produced ## Warning in pf(beale, pp, df2): NaNs produced ## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters. ## In the plot of Hubert index, we seek a significant knee that corresponds to a ## significant increase of the value of the measure i.e the significant peak in Hubert ## index second differences plot. ##
## *** : The D index is a graphical method of determining the number of clusters. ## In the plot of D index, we seek a significant knee (the significant peak in Dindex ## second differences plot) that corresponds to a significant increase of the value of ## the measure. ## ## ******************************************************************* ## * Among all indices: ## * 4 proposed 2 as the best number of clusters ## * 15 proposed 3 as the best number of clusters ## * 1 proposed 9 as the best number of clusters ## * 2 proposed 12 as the best number of clusters ## * 1 proposed 15 as the best number of clusters ## ## ***** Conclusion ***** ## ## * According to the majority rule, the best number of clusters is 3 ## ## ## ******************************************************************* ## $All.index ## KL CH Hartigan CCC Scott Marriot TrCovW ## 2 0.9879 47.8476 46.8668 0.0343 276.0360 3.140149e+31 40710.600 ## 3 6.9271 53.4919 11.8479 2.4116 614.6586 1.031701e+31 17024.154 ## 4 1.9541 41.8097 7.6923 1.7436 740.6484 8.964808e+30 14615.346 ## 5 0.3752 34.4811 11.7263 0.8740 775.2750 1.150580e+31 13070.155 ## 6 3.2719 31.6357 5.6969 1.7136 1046.5952 3.546293e+30 11728.891 ## 7 0.3635 28.0302 9.6189 1.1360 1062.9247 4.399201e+30 10690.368 ## 8 8.0983 26.6091 1.6658 2.0726 1248.5694 2.001079e+30 10132.264 ## 9 0.1237 23.5808 9.1109 0.4677 1240.2348 2.655434e+30 9754.703 ## 10 2.8166 22.9782 4.4561 1.5187 1420.0331 1.180276e+30 9082.395 ## 11 0.2924 21.5505 10.6233 1.0529 1452.0060 1.190893e+30 8159.294 ## 12 20.9265 21.6862 1.2317 3.0845 1587.4642 6.564413e+29 7360.076 ## 13 0.6067 20.0082 2.2031 1.9743 1601.9958 7.093524e+29 7093.830 ## 14 0.2623 18.7724 4.0431 1.2979 1604.9163 8.091423e+29 6960.353 ## 15 0.4895 18.0432 6.8441 1.3062 1688.4084 5.779998e+29 6311.849 ## TraceW Friedman Rubin Cindex DB Silhouette Duda Pseudot2 ## 2 2196.102 18.4178 1.2750 0.3351 1.8944 0.2096 1.4831 -40.7199 ## 3 1730.101 34.7660 1.6184 0.3492 1.4922 0.2595 1.1740 -16.8940 ## 4 1619.209 37.5966 1.7292 0.3341 1.8933 0.1847 1.0343 -2.9838 ## 5 1549.894 36.0928 1.8066 0.3597 2.0448 0.1360 2.1336 -31.3467 ## 6 1450.431 45.7343 1.9305 0.3401 2.0724 0.1403 1.7784 -22.7608 ## 7 1403.401 44.6696 1.9952 0.3359 2.1658 0.1220 0.9013 1.3142 ## 8 1327.826 50.0555 2.1087 0.3320 1.8608 0.1430 14.6467 -37.2690 ## 9 1314.789 47.3518 2.1296 0.2918 2.2077 0.1063 0.8099 1.1737 ## 10 1246.770 54.2760 2.2458 0.3291 1.8834 0.1254 10.0472 -20.7108 ## 11 1214.176 52.9269 2.3061 0.3202 1.9492 0.1073 1.2512 -5.0186 ## 12 1140.732 56.3496 2.4546 0.3040 1.6838 0.1363 11.3341 -32.8237 ## 13 1132.229 55.9126 2.4730 0.3437 1.8059 0.1271 14.3967 -29.7773 ## 14 1117.129 54.8679 2.5064 0.3150 1.8031 0.1080 1.4219 -5.9344 ## 15 1089.928 58.2568 2.5690 0.3177 1.7510 0.1159 1.9705 -15.7602 ## Beale Ratkowsky Ball Ptbiserial Frey McClain Dunn Hubert ## 2 -3.5597 0.2819 1098.0509 0.4127 -0.2789 0.7838 0.1911 0.0009 ## 3 -1.6203 0.3240 576.7002 0.6014 1.5881 0.9888 0.2159 0.0014 ## 4 -0.3618 0.3010 404.8023 0.5429 3.7032 1.5169 0.1672 0.0014 ## 5 -5.6606 0.2878 309.9788 0.4741 0.4185 2.1510 0.1953 0.0014 ## 6 -4.6850 0.2706 241.7385 0.4602 1.1971 2.8002 0.1818 0.0016 ## 7 1.1389 0.2570 200.4858 0.4403 -0.1664 3.1801 0.1703 0.0016 ## 8 0.0000 0.2496 165.9782 0.4735 -4.9422 3.1096 0.2039 0.0017 ## 9 2.3578 0.2367 146.0876 0.3781 -0.2168 4.8688 0.1566 0.0017 ## 10 0.0000 0.2308 124.6770 0.4050 0.6958 4.5901 0.1915 0.0018 ## 11 -2.0794 0.2236 110.3796 0.3739 -0.2538 5.7208 0.1794 0.0019 ## 12 0.0000 0.2201 95.0610 0.4121 2.9398 5.1222 0.1915 0.0021 ## 13 -5.1407 0.2119 87.0945 0.3860 -10.0280 5.9149 0.1996 0.0019 ## 14 -2.9505 0.2054 79.7950 0.3518 0.4371 7.1449 0.1853 0.0020 ## 15 -5.1015 0.2001 72.6619 0.3444 0.0494 7.5748 0.1882 0.0020 ## SDindex Dindex SDbw ## 2 1.2137 3.3660 0.8692 ## 3 0.9673 3.0084 0.7205 ## 4 1.1434 2.9042 0.6820 ## 5 1.2753 2.8395 0.6512 ## 6 1.2659 2.7410 0.6485 ## 7 1.2423 2.7026 0.6095 ## 8 1.3424 2.6465 0.6509 ## 9 1.3082 2.6129 0.5780 ## 10 1.4906 2.5608 0.6229 ## 11 1.2901 2.5154 0.5744 ## 12 1.3782 2.4612 0.5859 ## 13 1.3520 2.4377 0.5456 ## 14 1.1702 2.4169 0.5264 ## 15 1.1710 2.3802 0.5092 ## ## $All.CriticalValues ## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale ## 2 0.8453 22.8792 1.0000 ## 3 0.8483 20.3887 1.0000 ## 4 0.8385 17.3373 1.0000 ## 5 0.7555 19.0928 1.0000 ## 6 0.7683 15.6803 1.0000 ## 7 0.7004 5.1334 0.3193 ## 8 0.1807 181.3138 NaN ## 9 0.6420 2.7882 0.0036 ## 10 0.1807 104.2554 NaN ## 11 0.6929 11.0826 1.0000 ## 12 0.1807 163.1824 NaN ## 13 0.3238 66.8342 1.0000 ## 14 0.6278 11.8551 1.0000 ## 15 0.6929 14.1857 1.0000 ## ## $Best.nc ## KL CH Hartigan CCC Scott Marriot ## Number_clusters 12.0000 3.0000 3.0000 12.0000 3.0000 3.000000e+00 ## Value_Index 20.9265 53.4919 35.0189 3.0845 338.6226 1.973229e+31 ## TrCovW TraceW Friedman Rubin Cindex DB ## Number_clusters 3.00 3.0000 3.0000 3.0000 9.0000 3.0000 ## Value_Index 23686.45 355.1098 16.3482 -0.2326 0.2918 1.4922 ## Silhouette Duda PseudoT2 Beale Ratkowsky Ball ## Number_clusters 3.0000 2.0000 2.0000 2.0000 3.000 3.0000 ## Value_Index 0.2595 1.4831 -40.7199 -3.5597 0.324 521.3507 ## PtBiserial Frey McClain Dunn Hubert SDindex Dindex ## Number_clusters 3.0000 1 2.0000 3.0000 0 3.0000 0 ## Value_Index 0.6014 NA 0.7838 0.2159 0 0.9673 0 ## SDbw ## Number_clusters 15.0000 ## Value_Index 0.5092 ## ## $Best.partition ## [1] 1 2 1 1 2 1 1 1 2 2 2 3 1 2 1 1 1 2 2 2 1 1 2 1 1 3 3 1 1 2 1 2 1 2 1 ## [36] 1 1 3 3 2 1 2 1 3 3 1 2 3 1 3 1 1 3 1 3 1 3 1 1 1 1 2 1 1 3 2 1 2 1 1 ## [71] 2 2 3 2 1 1 3 3 3 1 1 2 2 1 1 2 2 1 1 1 2 1 1 3 2 1 3 1 1 3 2 2 1 2 2 ## [106] 1 1 2 2 1 2 1 1 1 3 1 1 1 3 3 1 1 2 1 2 2 1 1 2 3 2 3 1 1 3 3 1 1 1 1 ## [141] 1 1 1 2 1 1 2 1 2 1 1 2 1 2 2 2 3 3 1 1 1 1 3 1 1 1 1 1 1 2 1 1 1 1 2 ## [176] 1
Interestingly, the most recommended number of clusters by far is 3, which tallies with the results I got when I tried 4 clusters. Let’s see what this looks like. It’s worth noting that currently broom
does not work with NbClust objects, so I’ll use the same code as I used before, uesing the kmeans()
function found in base R.
set.seed(125) kmeans(swgoh_scaled, centers = 3, nstart = 50) %>% broom::augment(swgoh_scaled) %>% bind_cols(swgoh_stats[,1]) %>% left_join(swgoh_roles) %>% select(`Character Name`, .cluster, role) %>% mutate(x = runif(n(), 0, 10), y = runif(n(), 0, 10)) %>% ggplot(aes(x=x,y=y, label=`Character Name`)) + ggrepel::geom_text_repel(aes(col = role), size = 3) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + facet_wrap(~.cluster) ## Joining, by = "Character Name"
We can see that this clusters slightly better, but there does still seem to be some overlap.
Hierarchical Clustering
The two forms of hierarchical clustering essentially describe whether they are going “bottom up” or “top down”. The former is given the name Hierarchical Agglomerative Clustering (HAC) or Agglomerative Nesting (AGNES) and is the most common. The latter is called Divisive Analysis Clustering (DIANA).
Here, we’ll focus on HAC or “T-1000 clustering” (as I like to call it) as shown in the video below.
The algorithm is accessed through the hclust()
function in base R, and the good news is we don’t have to (initially) supply a number of clusters, nor a number of repeats, because the algorithm is deterministic rather than stochastic in that it eventually tries all possible numbers of clusters. The bad news however is that we must supply two other inputs:
- A distance metric
- A clustering method
These essentially answer the two key questions the algorithm has in performing HAC:
- How would you like to measure distance?
- How would you like to define distance between two clusters? (linkage criterion)
The most intuitive distance metric is straight line distance between two points (euclidean) and is probably the most often used, but other possibilities include “manhattan” (summing up the vector component distances) and “maximum” (taking the maximum of the vector components distances).
There are also many ways of calculating the distance between two clusters, including: taking the distance between the two closest points (“single”), or the distance between the furthest points (“complete”), taking the average or median distance between all points in the two clusters (“average”/“median”), minimising the variance between clusters (Ward’s method). If you wanted to cluster by correlation rather than distance, you could provide the hclust()
function the value of 1 - abs(cor(x))
.
The first thing I’ll do here is untidy my data by putting the character names as row names, as it will prove useful when I visualise later.
swgoh_stats_rn <- read_csv("content/post/data/unsupervised-learning/swgoh_stats.csv") %>% column_to_rownames("Character Name") swgoh_scaled <- scale(swgoh_stats_rn)
The hclust()
function takes a distance matrix as input, so a simple model call would look something like this:
hclust_model <- swgoh_scaled %>% dist("euclidean") %>% hclust("complete") hclust_model ## ## Call: ## hclust(d = ., method = "complete") ## ## Cluster method : complete ## Distance : euclidean ## Number of objects: 176
Unlike kmeans()
, this model doesn’t really give anything useful when it’s printed out. Also, the broom
package doesn’t seem to work on hclust
objects, but this isn’t such a big deal as half the point of hierarchical clustering is being able to see the clustering evolution visually. On that note…
Dendrograms
The usual way hierarchical clustering is visualised is through a dendrogram. Note that the character names have been included as the information was contained in the row names. Without this, it would just show row numbers. The branches tend to get a bit shorter when height is below 8, suggesting the best number of clusters is below less than 8.
plot(hclust_model)
We can align the labels a bit more nicely by setting the hang
parameter to -1:
plot(hclust_model, hang = -1)
I wanted to find out which combinations of distance metric and clustering method provided the best results. I tried all combinations and found that the Ward methods seemed to provide the most clear clustering. However during the course of my research I found a goldmine on information on Stack Overflow from a user called ttnphns who seemed to be very knowledgeable on the subject of clustering, offering some great guidance:
- Do not choose your method by visually comparing dendrograms;
- Do not decide on the number of clusters by looking at dendrograms created by the Ward method;
- Decide on the distance metric consciously, according to what makes sense for your problem, rather than blindly trying different ones;
- Ensure you use distance metrics required by certain methods, e.g. Ward/centroid require euclidean distance;
- Hierarchical clustering is generally not recommended for problems with thousands of observations;
All that said, from what I’ve read, the single and centroid methods are used much less often than complete/average/Ward. For my analysis, I tried using the NbClust()
function using Ward, complete, and average linkage methods. The recommended number of clusters were:
- Ward – 3
- Complete – 4
- Average – 2 or 6
I’m interested to see the dendrogram for each of these methods, as well as the wordcloud, which I’ll generate by using the cutree()
function to cut the tree at the recommended number of clusters to get the cluster assignments for each character.
First up, Ward’s method:
hclust_model <- swgoh_scaled %>% dist("euclidean") %>% hclust("ward.D2") plot(hclust_model, hang = -1)
swgoh_stats %>% bind_cols(.cluster = cutree(hclust_model, k = 3)) %>% left_join(swgoh_roles) %>% select(`Character Name`, .cluster, role) %>% mutate(x = runif(n(), 0, 10), y = runif(n(), 0, 10)) %>% ggplot(aes(x=x,y=y, label=`Character Name`)) + ggrepel::geom_text_repel(aes(col = role), size = 3) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + facet_wrap(~.cluster) ## Joining, by = "Character Name"
With Ward’s method we get similar results to k-means.
hclust_model <- swgoh_scaled %>% dist("euclidean") %>% hclust("complete") plot(hclust_model, hang = -1)
swgoh_stats %>% bind_cols(.cluster = cutree(hclust_model, k = 4)) %>% left_join(swgoh_roles) %>% select(`Character Name`, .cluster, role) %>% mutate(x = runif(n(), 0, 10), y = runif(n(), 0, 10)) %>% ggplot(aes(x=x,y=y, label=`Character Name`)) + ggrepel::geom_text_repel(aes(col = role), size = 3) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + facet_wrap(~.cluster) ## Joining, by = "Character Name"
This wordcloud is very similar to the one using Ward’s method, however two characters have been separated into their own cluster – these characters have been released very recently.
hclust_model <- swgoh_scaled %>% dist("euclidean") %>% hclust("average") plot(hclust_model, hang = -1)
swgoh_stats %>% bind_cols(.cluster = cutree(hclust_model, k = 2)) %>% left_join(swgoh_roles) %>% select(`Character Name`, .cluster, role) %>% mutate(x = runif(n(), 0, 10), y = runif(n(), 0, 10)) %>% ggplot(aes(x=x,y=y, label=`Character Name`)) + ggrepel::geom_text_repel(aes(col = role), size = 3) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + facet_wrap(~.cluster) ## Joining, by = "Character Name"
Using average linkage with two clusters is not very helpful at all.
hclust_model <- swgoh_scaled %>% dist("euclidean") %>% hclust("average") plot(hclust_model, hang = -1)
swgoh_stats %>% bind_cols(.cluster = cutree(hclust_model, k = 6)) %>% left_join(swgoh_roles) %>% select(`Character Name`, .cluster, role) %>% mutate(x = runif(n(), 0, 10), y = runif(n(), 0, 10)) %>% ggplot(aes(x=x,y=y, label=`Character Name`)) + ggrepel::geom_text_repel(aes(col = role), size = 3) + theme(axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank()) + facet_wrap(~.cluster) ## Joining, by = "Character Name"
Despite this clustering looking quite awful and imbalanced, the characters that are in the sparse clusters almost all have something in common – they are among the newest characters released/reworked, and this definitely supports the idea that the developers are trying differently balanced characters.
I think in terms of clustering performance, Ward’s method seems to be better in this case, however the average method has brought out distinctions in newer characters.
In my next post I’ll be experimenting with some different visualisation packages for hierarchical clustering output.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.