Introducing mlr3cluster: Cluster Analysis Package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tired of learning to use multiple packages to access clustering algorithms?
Using different packages makes it difficult to compare the performance of clusterers?
It would be great to have just one package that makes interfacing all things clustering easy?
mlr3cluster to the rescue!
mlr3cluster is a cluster analysis extention package within the mlr3 ecosystem. It is a successsor of mlr’s cluster capabilities in spirit and functionality.
In order to understand the following introduction and tutorial you need to be familiar with R6 and mlr3 basics. See chapters 1-2 of the mlr3book if you need a refresher.
Installation
To install the package, run the following code chunk:
install.packages("mlr3cluster")
Getting Started
Assuming you know all the basics and you’ve installed the package, here’s an example on how to perform k-means clustering on a classic usarrests data set:
library(mlr3) library(mlr3cluster) task = mlr_tasks$get("usarrests") learner = mlr_learners$get("clust.kmeans") learner$train(task) preds = learner$predict(task = task) preds ## <PredictionClust> for 50 observations: ## row_id partition ## 1 1 ## 2 1 ## 3 1 ## --- ## 48 2 ## 49 2 ## 50 2
Integrated Learners
What built-in clusterers does the package come with? Here is a list of integrated learners:
mlr_learners$keys("clust") ## [1] "clust.agnes" "clust.cmeans" "clust.dbscan" ## [4] "clust.diana" "clust.fanny" "clust.featureless" ## [7] "clust.kmeans" "clust.pam" "clust.xmeans"
The library contains all the basic types of clusterers: partitional, hierarchial, density-based and fuzzy. Below is a detailed list of all the learners.
ID | Learner | Package |
---|---|---|
clust.agnes | Agglomerative Hierarchical Clustering | cluster |
clust.cmeans | Fuzzy C-Means Clustering | e1071 |
clust.dbscan | Density-based Clustering | dbscan |
clust.diana | Divisive Hierarchical Clustering | cluster |
clust.fanny | Fuzzy Clustering | cluster |
clust.featureless | Simple Featureless Clustering | mlr3cluster |
clust.kmeans | K-Means Clustering | stats |
clust.pam | Clustering Around Medoids | cluster |
clust.xmeans | K-Means with Automatic Determination of k | RWeka |
Integrated Measures
List of integrated cluster measures:
mlr_measures$keys("clust") ## [1] "clust.ch" "clust.db" "clust.dunn" ## [4] "clust.silhouette"
Below is a detailed list of all the integrated learners.
ID | Measure | Package |
---|---|---|
clust.db | Davies-Bouldin Cluster Separation | clusterCrit |
clust.dunn | Dunn index | clusterCrit |
clust.ch | Calinski Harabasz Pseudo F-Statistic | clusterCrit |
clust.silhouette | Rousseeuw’s Silhouette Quality Index | clusterCrit |
Integrated Tasks
There is only one built-in Task in the package:
mlr_tasks$get("usarrests") ## <TaskClust:usarrests> (50 x 4) ## * Target: - ## * Properties: - ## * Features (4): ## - int (2): Assault, UrbanPop ## - dbl (2): Murder, Rape
As you can see, the biggest difference in clustering tasks as compared to the rest of the tasks in mlr3 is the absense of the Target column.
Hyperparameters
Setting hyperparameters for clusterers is as easy as setting parameters for any other mlr3 learner:
task = mlr_tasks$get("usarrests") learner = mlr_learners$get("clust.kmeans") learner$param_set ## <ParamSet> ## id class lower upper levels ## 1: centers ParamUty NA NA ## 2: iter.max ParamInt 1 Inf ## 3: algorithm ParamFct NA NA Hartigan-Wong,Lloyd,Forgy,MacQueen ## 4: nstart ParamInt 1 Inf ## 5: trace ParamInt 0 Inf ## default value ## 1: 2 2 ## 2: 10 ## 3: Hartigan-Wong ## 4: 1 ## 5: 0 learner$param_set$values = list(centers = 3L, algorithm = "Lloyd", iter.max = 100L)
Train and Predict
The “train” method is simply creating a model with cluster assignments for data, while the “predict” method’s functionality varies depending on the clusterer in question. Read the each learner’s documentation for details.
For example, the kmeans
learner’s predict method uses clue::cl_predict
which performs cluster assignments for new data by looking at the “closest” neighbors of the new observations.
Following the example from the previous section:
task = mlr_tasks$get("usarrests") train_set = sample(task$nrow, 0.8 * task$nrow) test_set = setdiff(seq_len(task$nrow), train_set) learner = mlr_learners$get("clust.kmeans") learner$train(task, row_ids = train_set) preds = learner$predict(task, row_ids = test_set) preds ## <PredictionClust> for 10 observations: ## row_id partition ## 1 1 ## 3 1 ## 4 1 ## --- ## 31 1 ## 32 1 ## 44 2
Benchmarking and Evaluation
To assess the quality of any machine learning experiment, you need to choose an evaluation metric that makes the most sense. Let’s design an experiment that will allow you to compare the performance of three different clusteres on the same task. The mlr3 library provides benchmarking functionality that lets you create such experiments.
# design an experiment by specifying task(s), learner(s), resampling method(s) design = benchmark_grid( tasks = tsk("usarrests"), learners = list( lrn("clust.kmeans", centers = 3L), lrn("clust.pam", k = 3L), lrn("clust.cmeans", centers = 3L)), resamplings = rsmp("holdout")) print(design) ## task learner resampling ## 1: <TaskClust[41]> <LearnerClustKMeans[30]> <ResamplingHoldout[19]> ## 2: <TaskClust[41]> <LearnerClustPAM[30]> <ResamplingHoldout[19]> ## 3: <TaskClust[41]> <LearnerClustCMeans[30]> <ResamplingHoldout[19]> # execute benchmark bmr = benchmark(design) ## INFO [18:11:46.839] Benchmark with 3 resampling iterations ## INFO [18:11:46.883] Applying learner 'clust.kmeans' on task 'usarrests' (iter 1/1) ## INFO [18:11:46.894] Applying learner 'clust.pam' on task 'usarrests' (iter 1/1) ## INFO [18:11:46.907] Applying learner 'clust.cmeans' on task 'usarrests' (iter 1/1) ## INFO [18:11:46.923] Finished benchmark # define measure measures = list(msr("clust.silhouette")) bmr$aggregate(measures) ## resample_result nr task_id learner_id resampling_id iters ## 1: <ResampleResult[19]> 1 usarrests clust.kmeans holdout 1 ## 2: <ResampleResult[19]> 2 usarrests clust.pam holdout 1 ## 3: <ResampleResult[19]> 3 usarrests clust.cmeans holdout 1 ## clust.silhouette ## 1: 0.4263124 ## 2: 0.4986824 ## 3: 0.5038553
Visualization
How do you visualize clustering tasks and results?
The mlr3viz
package (version >= 0.40) now provides that functionality.
install.packages("mlr3viz") library(mlr3viz) task = mlr_tasks$get("usarrests") learner = mlr_learners$get("clust.kmeans") learner$param_set$values = list(centers = 3L) learner$train(task) preds = learner$predict(task) # Task visualization autoplot(task)
# Pairs plot with cluster assignments autoplot(preds, task)
# Silhouette plot with mean silhouette value as reference line autoplot(preds, task, type = "sil")
# Performing PCA on task data and showing cluster assignments autoplot(preds, task, type = "pca")
Keep in mind that mlr3viz::autoplot
also provides more options depending on the kind of plots you’re interested in.
For example, to draw borders around clusters, provide appropriate parameters from ggfortify::autoplot.kmeans
:
autoplot(preds, task, type = "pca", frame = TRUE)
You can also easily visualize dendrograms:
task = mlr_tasks$get("usarrests") learner = mlr_learners$get("clust.agnes") learner$train(task) # Simple dendrogram autoplot(learner)
# More advanced options from `factoextra::fviz_dend` autoplot(learner, k = learner$param_set$values$k, rect_fill = TRUE, rect = TRUE, rect_border = c("red", "cyan"))
Further Development
If you have any issues with the package or would like to request a new feature, feel free to open an issue here.
Acknowledgements
I would like to thank the following people for their help and guidance: Michel Lang, Lars Kotthoff, Martin Binder, Patrick Schratz, Bernd Bischl.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.