Overview of clustering methods in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Clustering is a very popular technique in data science because of its unsupervised characteristic – we don’t need true labels of groups in data. In this blog post, I will give you a “quick” survey of various clustering methods applied to synthetic but also real datasets.
What is clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
It is a technique of unsupervised learning, so clustering is used when no a priori information about data is available. This makes clustering a very strong technique for gaining insights into data and making more accurate decisions.
What is it good for?
Clustering is used for:
- To gain insight into data, generate hypotheses, detect anomalies, and identify salient features,
- To identify the degree of similarity among objects (i.e. organisms),
- As a method for organizing the data and summarising it through cluster prototypes (compression).
Classification to groups
The first use case is to group data, e.g. classify them into groups. For explanation purposes, I will generate synthetic data from three normal distributions plus three outliers (anomalies). Let’s load needed packages, generate randomly some data, and show the first use case in the visualization:
We can see three nicely divided groups of data.
Anomaly detection
Clustering can be also used as an anomaly detection technique, some methods of clustering can detect automatically outliers (anomalies). Let’s show visually what it looks like.
Data compression
In an era of a large amount of data (also many times used buzzword - big data), we have problems processing them in real time. Here clustering can help to reduce dimensionality by its compression feature. Created clusters, that incorporate multiple points (data), can be replaced by their representatives (prototypes) - so one point. In this way, points were replaced by its cluster representative (“+”):
Types of clustering methods
Since cluster analysis has been here for more than 50 years, there are a large amount of available methods. The basic classification of clustering methods is based on the objective to which they aim: hierarchical, non-hierarchical.
The hierarchical clustering is a multi-level partition of a dataset that is a branch of classification (clustering). Hierarchical clustering has two types of access to data. The first one, divisive clustering, starts with one big cluster that is then divided into smaller clusters. The second one, agglomerative clustering, starts with individual objects that are single-element clusters, and then they are gradually merged. The whole process of hierarchical clustering can be expressed (visualized) as a dendrogram.
The non-hierarchical clustering is dividing a dataset into a system of disjunctive subsets (clusters) so that an intersection of clusters would be an empty set.
Clustering methods can be also divided in more detail based on the processes in the method (algorithm) itself:
Non-hierarchical:
- Centroid-based
- Model-based
- Density-based
- Grid-based
Hierarchical:
- Agglomerative
- Divisive
But which to choose in your use case? Let’s dive deeper into the most known methods and discuss their advantages and disadvantages.
Centroid-based
The most basic (maybe just for me) type of clustering method is centroid-based. This type of clustering creates prototypes of clusters - centroids or medoids.
The best well-known methods are:
- K-means
- K-medians
- K-medoids
- K-modes
K-means
Steps:
- Create random K clusters (and compute centroids).
- Assign points to the nearest centroids.
- Update centroids.
- Go to step 2 while the centroids are changing.
Pros and cons:
- [+] Fast to compute. Easy to understand.
- [-] Various initial clusters can lead to different final clustering.
- [-] Scale-dependent.
- [-] Creates only convex (spherical) shapes of clusters.
- [-] Sensitive to outliers.
K-means - example
It is very easy to try K-means in R (by the kmeans
function), only needed parameter is a number of clusters.
We can see example, when K-means fails most often, so when there are outliers in the dataset.
K-medoids
The problem with outliers solves K-medoids because prototypes are medoids - members of the dataset. So, not artificially created centroids, which helps to tackle outliers.
Pros and cons:
- [+] Easy to understand.
- [+] Less sensitive to outliers.
- [+] Possibility to use any distance measure.
- [-] Various initial clusters can lead to different final clustering.
- [-] Scale-dependent.
- [-] Slower than K-means.
K-medoids - example
K-medoids problem can be solved by the Partition Around Medoids (PAM) algorithm (function pam
in cluster
package).
We can see that medoids stayed nicely in the three main groups of data.
The determination of the number of clusters
The disadvantage of centroid-based methods is that the number of clusters needs to be known in advance (it is a parameter of the methods). However, we can determine the number of clusters by its Internal validation (index). Basic steps are based on that we compute some internal validation index with many ( K ) and we choose ( K ) with the best index value.
Many indexes are there…
- Silhouette
- Davies-Bouldin index
- Dunn index
- etc.
However, every index has similar characteristics:
\[\frac{within-cluster-similarity}{between-clusters-similarity} .\]so, it is the ratio of the average distances in clusters and between clusters.
Elbow diagram
The Elbow diagram is a simple method (rule) how to determine the number of clusters - we compute the internal index with a set of K and choose K where positive change is largest.
So for example, I chose the Davies-Bouldin index implemented in the clusterCrit
package. For our simple dataset, I will generate clusterings with 2-6 number of clusters and compute the index.
We can see that the Elbow diagram rule chose 4 clusters - makes sense to me actually…
We can also try it with PAM - K-medoids.
It is the same result.
Model-based
Model-based clustering methods are based on some probabilistic distribution. It can be:
- Gaussian normal distribution
- Gamma distribution
- Student’s t-distribution
- Poisson distribution
- etc.
Since we cluster multivariate data, model-based clustering uses Multivariate distributions and a so-called Mixture of models (Mixtures -> clusters). When using clustering with Gaussian normal distribution, we are using the theory of Gaussian Mixture Models (GMM).
GMM
The target is to maximize likelihood: \(L(\boldsymbol{\mu_1}, \dots, \boldsymbol{\mu_k}, \boldsymbol{\Sigma_1}, \dots, \boldsymbol{\Sigma_k} | \boldsymbol{x_1}, \dots, \boldsymbol{x_n}).\) Here, cluster is represented by mean (( \mathbf{\mu} )) and covariance matrix (( \mathbf{\Sigma} )). So not just centroid as in the case of K-means.
This optimization problem is typically solved by the EM algorithm (Expectation Maximization).
Pros and cons:
- [+] Ellipsoidal clusters,
- [+] Can be parameterized by covariance matrix,
- [+] Scale-independent,
- [-] Very slow for high-dimensional data,
- [-] Can be difficult to understand.
EM algorithm with GMM is implemented in the mclust
package. You can optimize various shapes of mixtures (clusters) by the modelNames
parameter (check the ?mclustModelNames
function for more details).
Pretty interesting red ellipse that was created, but generally clustering is OK.
BIC
The Bayesian Information Criterion (BIC) for choosing the optimal number of clusters can be used with model-based clustering.
In the mclust
package, you can just add multiple modelNames
and it chooses by BIC the best one.
We can try also to vary the dependency of covariance matrix ( \mathbf{\Sigma} ).
The result:
So, the methodology chose 6 clusters - 3 main groups of data and all 3 anomalies in separate clusters.
Density-based
Density-based clusters are based on maximally connected components of the set of points that lie within some defined distance from some core object.
Methods:
DBSCAN
In the well-known method DBSCAN, density is defined as neighborhood, where points have to be reachable within a defined distance (( \epsilon ) distance - first parameter of the method), however, clusters must have at least some number of minimal points (second parameter of the method). Points that weren’t connected with any cluster and did not pass the minimal points criterion are marked as noise (outliers).
Pros and cons:
- [+] Extracts automatically outliers,
- [+] Fast to compute,
- [+] Can find clusters of arbitrary shapes,
- [+] The number of clusters is determined automatically based on data,
- [-] Parameters (( \epsilon ), minPts) must be set by a practitioner,
- [-] Possible problem with neighborhoods - can be connected.
DBSCAN is implemented in the same named function and package, so let’s try it.
We can see that DBSCAN found 3 clusters and 3 outliers correctly when parameters are wisely chosen.
Bananas - DBSCAN result
To demonstrate the strength of DBSCAN, researchers created many dummy artificial datasets, which are many times called bananas.
Bananas - K-means result
K-means here is not a good choice obviously…but these datasets are far from real-world either.
Spectral clustering
Spectral clustering methods are based on the spectral decomposition of data, so the creation of eigen vectors and eigen values.
Steps:
- N = number of data, d = dimension of data,
- ( \mathbf{A} ) = affinity matrix, ( A_{ij} = \exp(- (data_i - data_j)^2 / (2*\sigma^2) ) ) - N by N matrix,
- ( \mathbf{D} ) = diagonal matrix whose (i,i)-element is the sum of ( \mathbf{A} ) i-th row - N by N matrix,
- ( \mathbf{L} ) = ( \mathbf{D}^{-1/2} \mathbf{A} \mathbf{D}^{-1/2} ) - N by N matrix,
- ( \mathbf{X} ) = union of k largest eigenvectors of ( \mathbf{L} ) - N by k matrix,
- Renormalising each of ( \mathbf{X} ) rows to have unit length - N by k matrix,
- Run K-means algorithm on ( \mathbf{X} ).
Typical use case for spectral clustering
We will try spectral clustering on the Spirals artificial dataset.
Spectral clustering is implemented in the kernlab
package and specc
function.
Let’s it try on more advanced data - compound data.
This is not a good result, let’s try DBSCAN.
Again, the nice result for DBSCAN on the artificial dataset.
Hierarchical clustering
The result of a hierarchical clustering is a dendrogram. The dendrogram can be cut at any height to form a partition of the data into clusters. How data points are connected in the dendrogram has multiple possible ways (linkages) and criteria:
- Single-linkage
- Complete-linkage
- Average-linkage
- Centroid-linkage
- Ward’s minimum variance method
- etc.
Criteria:
- single-linkage: ( \min { d(a,b):a\in A, b\in B } )
- complete-linkage: ( \max { d(a,b):a\in A, b\in B } )
-
average-linkage: ( \frac{1}{ A B }\sum_{a\in A}\sum_{b\in B}d(a,b) ) -
centroid-linkage: ( c_t - c_s ), where ( c_s ) and ( c_t ) are the centroids of clusters ( s ) and ( t ).
where ( d(a,b) ) is the distance between points ( a ) and ( b ).
IRIS dataset use case
Let’s try hierarchical clustering on the IRIS dataset.
Single linkage:
Complete linkage:
Average linkage:
Let’s compute the precision of these three clusterings with the clusterCrit
package:
The best results were obtained with average linkage with precision of 81.9%.
Connected data
I have prepared for you the last use case for most shown methods where data (and clusters) are closely connected, so the closest scenario of real data.
DBSCAN - result for connected data
Chosen parameters are ( \epsilon = 0.08 ), ( minPts = 18 ).
The result is quite good enough, where the main three core groups are identified. Let’s change minPts to 10.
We can see a use case where DBSCAN is very sensitive to parameter settings, and you have to be very careful with some automatic settings of these parameters (in your use cases).
K-means - result for connected data
Nice result to be fair for this simple method.
Gaussian model-based clustering result
Almost perfect result, but due to the normality of sampled data.
Spectral clustering result
Very nice result again!
Other types of clustering methods
Other types of clustering methods that were not covered in this blog post are:
- Grid-based
- Subspace clustering
- Multi-view clustering
- Based on artificial neural networks (e.g. SOM)
- Consensus (ensemble) clustering
- Data stream(s) clustering
- etc.
Conclusions
- We have many types of clustering methods
- Different datasets need different clustering methods
- Automatic determination of a number of clusters can be tricky
- Real datasets are usually connected - density-based methods can fail
- Outliers (anomalies) can significantly influence clustering results - the solution is to preprocess the data or use density-based clustering
- Some methods are not suited for large (or high-dimensional) datasets - model-based or spectral clustering
- Some methods are not suited for non-convex clusters - K-means, basic hierarchical clustering
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.