The number of clusters in Hierarchical Clustering

Posted on January 22, 2014 by chenangen in R bloggers | 0 Comments

[This article was first published on Chen-ang Statistics » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cluster analysis is widely applied in data analysis. Obviously hierarchical clustering is the simple and important method to do clustering. In brief, hierarchical clustering methods use the elements of a proximity matrix to generate a tree diagram or dendogram. From the tree diagram, we can draw our own conclusions about the results of clustering. However, when the cluster analysis solution is given, the question is how to determine the number of clusters k. For some value of k, we want to determine whether the clusters are sufficiently separated so as to illustrate minimal overlap. There is no doubt that we can choose an appropriate threshold value or use scatter diagram to determine that. Furthermore, statistic value is also very useful to determine the value of k. There are some valuable test statistics or pseudo test statistics as follow. In addition, I also provide a corresponding R function to implement.

1 $R_k^2$ statistic

The $R_k^2$ for k clusters is defined as

$R_k^2=\frac{B_k}{T}=1-\frac{P_k}{T}$

T, P_k means total sum of squares, within cluster sum of squares, respectively. For n clusters, obviously each $P_k=0$ so that $R^2=1$ . As the number of clusters decreases from n to 1 they should become more widely separated. A large decrease in $R_k^2$ would represent a distinct join. Actually, we also can use semipartial R^2 statistic to reach our goal.

2 semipartial $R_k^2$ statistic

The semipartial $R_k^2$ for k clusters is defined as

$SR_k^2=\frac{B_{KL}^2}{T}=R_{k+1}^2-R_k^2$

$B_{KL}^2$ is equal to $W_M-(W_K+W_L)$ and $W_t$ means the sum of squares in cluster $G_t$ .

3 pseudo $F_k$ statistic

The pseudo $F_k$ statistic for k clusters is defined as

pseudo $F_k=\frac{(T-P_k)/(k-1)}{(P_k)/(n-k)}=1-\frac{B_k(n-k)}{P_k(k-1)}$

If pseudo $F_k$ decreases with k and reaches a maximum value, the value of k at the maximum or immediately prior to the point may be a candidate for the value of k.

4 pseudo $t^2$ statistic

The pseudo $t^2$ is defined as

pseudo $t^2=\frac{B_{KL}^2}{(W_L+W_K)/(n_K+n_L-2)}$

for joining cluster $G_L$ with $G_K$ each having $n_L$ and $n_K$ elements.

Implementation

As a matter of fact, SAS enables us to get the value of these statistics easily through the PROC CLUSTER and PROC TREE. However, it is not convenient to calculate them in R. Last semester, as a teaching assistant of the course of multivariate statistical analysis, the professor gave these assignments(writing R funtions to calculate one of these test statistics) to the students. In order to correct their codes, I also write a R function which can calculate all of these test statistics at the same time. The output of this function is similar with the SAS output. If your want to view the source code, please click this link.

Further discussion

Besides writing function, a package called NbClust offers a simper and better way to determine the number of clusters. It provides 30 popular indices and also proposes to user a recommended number of clusters. More details could be found from the reference manual of this package.

library(NbClust);
data(USArrests);
NbClust(USArrests,diss="NULL",distance="euclidean",
min.nc=2,max.nc=8,method="ward",index="pseudot2",
alphaBeale=0.1);

Please note that the output is a little different from the SAS output.

Reference

Timm, Neil H. Applied multivariate analysis. Springer, 2002.

To leave a comment for the author, please follow the link and comment on their blog: Chen-ang Statistics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

The number of clusters in Hierarchical Clustering

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)