Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Back in September 2016 I implemented the ClusterR package. One of the algorithms included in ClusterR was the ‘Partition Around Medoids’ (Cluster_Medoids) algorithm which was based on the paper “Anja Struyf, Mia Hubert, Peter J. Rousseeuw, (Feb. 1997), Clustering in an Object-Oriented Environment, Journal of Statistical Software, Vol 1, Issue 4” (at that time I didn’t have access to the book of Kaufman and Rousseeuw, Finding Groups in Data (1990) where the exact algorithm was described), thus I implemented the code and compared my results with the output of the cluster::pam() function, which was available at that time. Thus, my method was not an exact but an approximate one. Recently, a user of the ClusterR package opened an issue mentioning that the results were not optimal compared to the cluster::pam() function and this allowed me to go through my code once again and also to compare my results to new R packages that were not existent at that time. Most of these R packages include a new version of the ‘Partition Around Medoids’ algorithm, “Erich Schubert, Peter J. Rousseeuw,”Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms” 2019, <doi:10.1007/978-3-030-32047-8_16>”.
In this blog-post, I’ll use the following R packages,
to compare between the (as of December 2022) existing ‘Partition Around Medoids’ implementations in terms of output dissimilarity cost and elapsed time.
# required R packages require(ClusterR) ## Loading required package: ClusterR ## Loading required package: gtools require(cluster) ## Loading required package: cluster require(kmed) ## Loading required package: kmed require(fastkmedoids) ## Loading required package: fastkmedoids ## ## Attaching package: 'fastkmedoids' ## The following object is masked from 'package:cluster': ## ## pam require(sf) ## Loading required package: sf ## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2; sf_use_s2() is TRUE require(data.table) ## Loading required package: data.table require(geodist) ## Loading required package: geodist require(glue) ## Loading required package: glue require(magrittr) ## Loading required package: magrittr require(ggplot2) ## Loading required package: ggplot2 require(mapview) ## Loading required package: mapview require(knitr) ## Loading required package: knitr
Datasets
For comparison purposes, I’ll use the following datasets,
- a 2-column dataset using values of the Normal Distribution (standard deviation is equal to 0.25 and the mean parameter takes the value of 0 or 1)
- the ‘dietary_survey_IBS’ dataset which exists in the ClusterR package
- the ‘soybean’ dataset which exists in the ClusterR package
- the ‘agriculture’ dataset which exists in the cluster package
- a ‘geospatial’ dataset that shows how nicely the ‘Partition Around Medoids’ algorithm can cluster coordinate points based on a ‘geodesic’ distance matrix (assuming we can visually decide on the number of clusters)
The next function contains the previously mentioned datasets,
datasets_clust = function(data_name) { if (data_name == 'rnorm_data') { n = 100 set.seed(1) x = rbind(matrix(rnorm(n, mean = 0, sd = 0.25), ncol = 2), matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2)) lst_out = list(x = x, dataset = data_name) } else if (data_name == 'dietary_survey_IBS') { data(dietary_survey_IBS, package = 'ClusterR') x = dietary_survey_IBS[, -ncol(dietary_survey_IBS)] lst_out = list(x = x, dataset = data_name) } else if (data_name == 'soybean') { data(soybean, package = 'ClusterR') x = soybean[, -ncol(soybean)] lst_out = list(x = x, dataset = data_name) } else if (data_name == 'agriculture') { data(agriculture, package = 'cluster') x = agriculture lst_out = list(x = x, dataset = data_name) } else if (data_name == 'geospatial') { wkt_lst = list(black_sea = "POLYGON ((31.47957 43.64944, 31.47957 42.82356, 36.17885 42.82356, 36.17885 43.64944, 31.47957 43.64944))", caspian_sea = "POLYGON ((49.23243 42.75324, 49.23243 41.99335, 51.54848 41.99335, 51.54848 42.75324, 49.23243 42.75324))", mediterranean_sea = "POLYGON ((16.55062 35.08059, 16.55062 34.13804, 20.20771 34.13804, 20.20771 35.08059, 16.55062 35.08059))", red_sea = "POLYGON ((36.61694 23.61829, 36.61694 22.60845, 38.18655 22.60845, 38.18655 23.61829, 36.61694 23.61829))") nam_wkts = names(wkt_lst) CRS_wkt = 4326 sample_point_size = 250 # sample that many random points from each sea WKT polygon count = 1 all_dfs = list() for (nam in nam_wkts) { WKT = wkt_lst[[nam]] read_wkt_inp = sf::st_as_sfc(WKT, crs = sf::st_crs(CRS_wkt)) # to sample random points see: https://stackoverflow.com/a/70073632/8302386 random_lat_lons = sf::st_sample(x = read_wkt_inp, size = sample_point_size, type = "random", crs = sf::st_crs(CRS_wkt)) random_df = sf::st_coordinates(random_lat_lons) random_df = data.table::as.data.table(random_df) colnames(random_df) = c('lon', 'lat') random_df$sea = rep(x = nam, times = nrow(random_df)) random_df$code = rep(x = count, times = nrow(random_df)) all_dfs[[count]] = random_df count = count + 1 } dat = data.table::rbindlist(all_dfs) dat_sf = sf::st_as_sf(dat, coords = c('lon', 'lat'), crs = CRS_wkt) # add also an outlier which is between the 'mediterranean', 'black' and 'red' sea # and will be assigned to the one that is closest in terms of distance outlier_lon = 34.8988917972 outlier_lat = 35.0385655983 dat_sf_update = data.frame(lon = outlier_lon, lat = outlier_lat) x_mt_update = geodist::geodist(x = dat[, 1:2], y = dat_sf_update, measure = 'geodesic') x_mt_update = data.table::data.table(x_mt_update) x_mt_update$sea = dat$sea x_mt_update = x_mt_update[order(x_mt_update$V1, decreasing = F), ] dat_sf_update_obs = data.frame(lon = outlier_lon, lat = outlier_lat, sea = x_mt_update$sea[1], code = unique(subset(dat, sea == x_mt_update$sea[1])$code)) # it is assigned (based on distance) to the 'black sea' although it lies in the meditteranean dat_use = rbind(dat_sf_update_obs, dat) # leaflet map dat_sf_upd = sf::st_as_sf(dat_use, coords = c('lon', 'lat'), crs = CRS_wkt) mp = mapview::mapview(dat_sf_upd, zcol = 'sea', legend = TRUE) x = dat_use[, 1:2] lst_out = list(x = x, dataset = data_name, sea_names = dat_use$sea, code = dat_use$code, leaflet_map = mp) } else { stop(glue::glue("The dataset '{data_name}' does not exist!"), call. = FALSE) } return(lst_out) }
‘Partion Around Medoids’ R package implementations
The next function includes the
- cluster::pam(do.swap = TRUE, variant = ‘original’)
- cluster::pam(do.swap = TRUE, variant = ‘faster’)
- kmed::skm()
- fastkmedoids::pam()
- fastkmedoids::fastpam(initializer = “LAB”)
- ClusterR::Cluster_Medoids(swap_phase = TRUE)
implementations and will allow comparing the dissimilarity cost and elapsed time for the mentioned datasets. In the ClusterR package, I included the ClusterR::cost_clusters_from_dissim_medoids() function that takes
- a dissimilarity matrix
- the output medoids of each implementation
and returns the total dissimilarity cost.
compare_medoid_implementations = function(x, y = NULL, max_k = 15, geodesic_dist = FALSE, round_digits = 5, compute_rand_index = FALSE) { if (!geodesic_dist) { x = ClusterR::center_scale(data = x, mean_center = TRUE, sd_scale = TRUE) x_mt = ClusterR::distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE) } else { x_mt = geodist::geodist(x = x, measure = 'geodesic') } dv = as.vector(stats::dist(x)) # compute distance matrix for 'fastkmedoids' (lower triangular part) results = cluster_out = list() for (k in 2:max_k) { # cluster [ 'original' ] t_start = proc.time() set.seed(k) pm = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'original') pm_secs = as.numeric((proc.time() - t_start)["elapsed"]) pm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm$id.med)$cost # cluster [ 'faster' ] t_start = proc.time() set.seed(k) pm_fast = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'faster') pm_fast_secs = as.numeric((proc.time() - t_start)["elapsed"]) pm_fast_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm_fast$id.med)$cost # kmed t_start = proc.time() km = kmed::skm(distdata = x_mt, ncluster = k, seeding = k, iterate = 10) km_secs = as.numeric((proc.time() - t_start)["elapsed"]) km_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = km$medoid)$cost # fastkmedoids ('pam' function) t_start = proc.time() set.seed(k) fkm = fastkmedoids::pam(rdist = dv, n = nrow(x), k = k) fkm_secs = as.numeric((proc.time() - t_start)["elapsed"]) fkm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm@medoids + 1)$cost # output indices correspond to Cpp indexing, thus add 1 # fastkmedoids ('fastpam' function with 'initializer' set to 'LAB') t_start = proc.time() fkm_lab = fastkmedoids::fastpam(rdist = dv, n = nrow(x), k = k, initializer = "LAB", seed = k) fkm_lab_secs = as.numeric((proc.time() - t_start)["elapsed"]) fkm_lab_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm_lab@medoids + 1)$cost # output indices correspond to Cpp indexing, thus add 1 # ClusterR t_start = proc.time() clst = ClusterR::Cluster_Medoids(data = x_mt, clusters = k, verbose = FALSE, swap_phase = TRUE, seed = k) clst_secs = as.numeric((proc.time() - t_start)["elapsed"]) clst_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = clst$medoids)$cost clust_algos = c('cluster_pam', 'cluster_pam_fast', 'kmed_skm', 'fastkmedoids_pam', 'fastkmedoids_fastpam', 'ClusterR_Medoids') dissim_costs = c(round(pm_cost, digits = round_digits), round(pm_fast_cost, digits = round_digits), round(km_cost, digits = round_digits), round(fkm_cost, digits = round_digits), round(fkm_lab_cost, digits = round_digits), round(clst_cost, digits = round_digits)) time_bench = c(round(pm_secs, digits = round_digits), round(pm_fast_secs, digits = round_digits), round(km_secs, digits = round_digits), round(fkm_secs, digits = round_digits), round(fkm_lab_secs, digits = round_digits), round(clst_secs, digits = round_digits)) dtbl = list(pam_R_function = clust_algos, dissim_cost = dissim_costs, timing = time_bench, k = rep(k, length(clust_algos))) if (compute_rand_index) { # rand-index (or accuracy) clust_acc = list(k = k, cluster_pam = ClusterR::external_validation(true_labels = y, clusters = pm$clustering, method = 'rand_index'), cluster_pam_fast = ClusterR::external_validation(true_labels = y, clusters = pm_fast$clustering, method = 'rand_index'), kmed_skm = ClusterR::external_validation(true_labels = y, clusters = km$cluster, method = 'rand_index'), fastkmedoids_pam = ClusterR::external_validation(true_labels = y, clusters = fkm@assignment, method = 'rand_index'), fastkmedoids_fastpam = ClusterR::external_validation(true_labels = y, clusters = fkm_lab@assignment, method = 'rand_index'), ClusterR_Medoids = ClusterR::external_validation(true_labels = y, clusters = clst$clusters, method = 'rand_index')) data.table::setDT(clust_acc) cluster_out[[k]] = clust_acc } data.table::setDT(dtbl) results[[k]] = dtbl } results = data.table::rbindlist(results) if (compute_rand_index) { return(list(results = results, cluster_out = cluster_out)) } else { return(results) } }
In a for-loop, we’ll iterate over the datasets and the ‘partition around medoids’ implementations and we’ll save the output results to a list object,
dataset_names = c('rnorm_data', 'dietary_survey_IBS', 'soybean', 'agriculture', 'geospatial') lst_all = list() geo_flag = FALSE y = NULL for (dat_name in dataset_names) { iter_dat = datasets_clust(data_name = dat_name) cat(glue::glue("Dataset: '{dat_name}' Number of rows: {nrow(iter_dat$x)}"), '\n') if (dat_name == 'geospatial') { mp_view = iter_dat$leaflet_map y = iter_dat$code geo_flag = TRUE } else { geo_flag = FALSE } iter_out = suppressWarnings(compare_medoid_implementations(x = iter_dat$x, y = y, max_k = 10, geodesic_dist = geo_flag, round_digits = 5, compute_rand_index = geo_flag)) lst_all[[dat_name]] = iter_out } ## Dataset: 'rnorm_data' Number of rows: 100 ## Dataset: 'dietary_survey_IBS' Number of rows: 400 ## Dataset: 'soybean' Number of rows: 307 ## Dataset: 'agriculture' Number of rows: 12 ## Dataset: 'geospatial' Number of rows: 1001
then we’ll use a ggplot2 visualization function to visualize the dissimilarity cost and elapsed time for each k from 2 to 10
multi_plot_data = function(data_index, y_column, digits = 2, size_geom_bar_text = 7) { data_index$k = factor(data_index$k) # convert the 'k' column to factor levels(data_index$k) = as.factor(glue::glue("k = {levels(data_index$k)}")) data_index[[y_column]] = round(data_index[[y_column]], digits = digits) plt = ggplot2::ggplot(data_index, ggplot2::aes(x = factor(pam_R_function), y = .data[[y_column]], fill = factor(pam_R_function))) + ggplot2::geom_bar(stat = "identity") + ggplot2::facet_wrap(~k, scales = "free_y") + ggplot2::geom_text(ggplot2::aes(label = .data[[y_column]]), position = ggplot2::position_dodge(width = 0.9), col = 'black', face = 'bold', size = size_geom_bar_text) + ggplot2::theme(panel.background = ggplot2::element_rect(fill = "white", colour = "#6D9EC1", size = 3, linetype = "solid"), strip.text.x = ggplot2::element_text(size = 20, colour = 'black', face = 'bold'), strip.background = element_rect(fill = 'orange', colour = 'black'), plot.title = ggplot2::element_text(size = 19, hjust = 0.5), axis.text = ggplot2::element_text(size = 19, face = 'bold'), axis.text.x = ggplot2::element_text(size = 25, face = 'bold', angle = 35, hjust = 1), axis.title = ggplot2::element_text(size = 19, face = 'bold'), axis.title.x = ggplot2::element_blank(), panel.grid = ggplot2::element_blank(), legend.position = "none") return(plt) }
Visualization of the dissimilarity cost for each one of the datasets
rnorm_data dataset
y_axis = 'dissim_cost' data_name = 'rnorm_data' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
dietary_survey_IBS dataset
data_name = 'dietary_survey_IBS' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
soybean dataset
data_name = 'soybean' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
agriculture dataset
data_name = 'agriculture' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
geospatial dataset
data_name = 'geospatial' multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis, size_geom_bar_text = 5)
For the ‘geospatial’ data I also computed the Rand Index (or accuracy) and for k = 4 (which is the actual number of clusters as the following leaflet map shows) we have a perfect clustering (or a rand-index of 1.0). I also included a single outlier (a pair of coordinates) between the ‘mediterranean’, ‘black’ and ‘red’ sea, which shall be assigned to the ‘black’ sea based on the distance (which happens in all implementations),
dtbl_rand_index = data.table::rbindlist(lst_all[[data_name]]$cluster_out) knitr::kable(round(dtbl_rand_index, digits = 3))
k | cluster_pam | cluster_pam_fast | kmed_skm | fastkmedoids_pam | fastkmedoids_fastpam | ClusterR_Medoids |
---|---|---|---|---|---|---|
2 | 0.624 | 0.624 | 0.624 | 0.624 | 0.624 | 0.624 |
3 | 0.874 | 0.874 | 0.875 | 0.846 | 0.846 | 0.875 |
4 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
5 | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 | 0.969 |
6 | 0.938 | 0.938 | 0.937 | 0.938 | 0.938 | 0.938 |
7 | 0.927 | 0.906 | 0.907 | 0.927 | 0.927 | 0.927 |
8 | 0.917 | 0.896 | 0.896 | 0.917 | 0.917 | 0.917 |
9 | 0.886 | 0.886 | 0.865 | 0.885 | 0.885 | 0.885 |
10 | 0.875 | 0.854 | 0.855 | 0.875 | 0.875 | 0.855 |
The Leaflet map shows the 4 clusters of the geospatial dataset
which gives a rand-index of 1.0,
mp_view
Visualization of the elapsed time for each one of the datasets
rnorm_data dataset
y_axis = 'timing' data_name = 'rnorm_data' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
dietary_survey_IBS dataset
data_name = 'dietary_survey_IBS' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
soybean dataset
data_name = 'soybean' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
agriculture dataset
data_name = 'agriculture' multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)
geospatial dataset
data_name = 'geospatial' multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis)
Conclusions
- Using different datasets and a range of values for k (from 2 to 10) we see that even the exact algorithm does not always give the lowest dissimilarity cost. Moreover, there is a difference in the dissimilarity cost for the different ‘k’ values and the included ‘datasets’ between the various implementations. This probably has to do with the fact that the majority of these implementations are approximate and also differently implemented.
- Regarding the elapsed time the bar-plots show that the fastkmedoids::fastpam() function returns the results faster (compared to the other implementations) in all cases and this is more obvious in the ‘geospatial’ dataset where we have a medium sized dataset consisting of 1000 observations.
Notes
All ‘Partition Around Medoid’ functions take a dissimilarity matrix as input and not the initial input data, therefore the elapsed time does not include the computation of the dissimilarity (or distance) matrix.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.