Site icon R-bloggers

Comparison of Partition Around Medoid R programming Implementations

[This article was first published on mlampros, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back in September 2016 I implemented the ClusterR package. One of the algorithms included in ClusterR was the ‘Partition Around Medoids’ (Cluster_Medoids) algorithm which was based on the paper “Anja Struyf, Mia Hubert, Peter J. Rousseeuw, (Feb. 1997), Clustering in an Object-Oriented Environment, Journal of Statistical Software, Vol 1, Issue 4” (at that time I didn’t have access to the book of Kaufman and Rousseeuw, Finding Groups in Data (1990) where the exact algorithm was described), thus I implemented the code and compared my results with the output of the cluster::pam() function, which was available at that time. Thus, my method was not an exact but an approximate one. Recently, a user of the ClusterR package opened an issue mentioning that the results were not optimal compared to the cluster::pam() function and this allowed me to go through my code once again and also to compare my results to new R packages that were not existent at that time. Most of these R packages include a new version of the ‘Partition Around Medoids’ algorithm, “Erich Schubert, Peter J. Rousseeuw,”Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms” 2019, <doi:10.1007/978-3-030-32047-8_16>”.

In this blog-post, I’ll use the following R packages,

to compare between the (as of December 2022) existing ‘Partition Around Medoids’ implementations in terms of output dissimilarity cost and elapsed time.


# required R packages

require(ClusterR)

## Loading required package: ClusterR

## Loading required package: gtools

require(cluster)

## Loading required package: cluster

require(kmed)

## Loading required package: kmed

require(fastkmedoids)

## Loading required package: fastkmedoids

## 
## Attaching package: 'fastkmedoids'

## The following object is masked from 'package:cluster':
## 
##     pam

require(sf)

## Loading required package: sf

## Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2; sf_use_s2() is TRUE

require(data.table)

## Loading required package: data.table

require(geodist)

## Loading required package: geodist

require(glue)

## Loading required package: glue

require(magrittr)

## Loading required package: magrittr

require(ggplot2)

## Loading required package: ggplot2

require(mapview)

## Loading required package: mapview

require(knitr)

## Loading required package: knitr


Datasets


For comparison purposes, I’ll use the following datasets,

The next function contains the previously mentioned datasets,


datasets_clust = function(data_name) {
  
  if (data_name == 'rnorm_data') {
    n = 100
    set.seed(1)
    x = rbind(matrix(rnorm(n, mean = 0, sd = 0.25), ncol = 2),
              matrix(rnorm(n, mean = 1, sd = 0.25), ncol = 2))
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'dietary_survey_IBS') {
    data(dietary_survey_IBS, package = 'ClusterR')
    x = dietary_survey_IBS[, -ncol(dietary_survey_IBS)]
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'soybean') {
    data(soybean, package = 'ClusterR')
    x = soybean[, -ncol(soybean)]
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'agriculture') {
    data(agriculture, package = 'cluster')
    x = agriculture
    lst_out = list(x = x, dataset = data_name)
  }
  else if (data_name == 'geospatial') {
    wkt_lst = list(black_sea = "POLYGON ((31.47957 43.64944, 31.47957 42.82356, 36.17885 42.82356, 36.17885 43.64944, 31.47957 43.64944))", 
                   caspian_sea = "POLYGON ((49.23243 42.75324, 49.23243 41.99335, 51.54848 41.99335, 51.54848 42.75324, 49.23243 42.75324))", 
                   mediterranean_sea = "POLYGON ((16.55062 35.08059, 16.55062 34.13804, 20.20771 34.13804, 20.20771 35.08059, 16.55062 35.08059))", 
                   red_sea = "POLYGON ((36.61694 23.61829, 36.61694 22.60845, 38.18655 22.60845, 38.18655 23.61829, 36.61694 23.61829))")
   
    nam_wkts = names(wkt_lst)
    CRS_wkt = 4326
    sample_point_size = 250              # sample that many random points from each sea WKT polygon
    
    count = 1
    all_dfs = list()
    
    for (nam in nam_wkts) {
      WKT = wkt_lst[[nam]]
      read_wkt_inp = sf::st_as_sfc(WKT, crs = sf::st_crs(CRS_wkt))
      # to sample random points see:  https://stackoverflow.com/a/70073632/8302386
      random_lat_lons = sf::st_sample(x = read_wkt_inp, size = sample_point_size, type = "random", crs = sf::st_crs(CRS_wkt))
      random_df = sf::st_coordinates(random_lat_lons)
      random_df = data.table::as.data.table(random_df)
      colnames(random_df) = c('lon', 'lat')
      random_df$sea = rep(x = nam, times = nrow(random_df))
      random_df$code = rep(x = count, times = nrow(random_df))
      all_dfs[[count]] = random_df
      
      count = count + 1
    }
    
    dat = data.table::rbindlist(all_dfs)
    dat_sf = sf::st_as_sf(dat, coords = c('lon', 'lat'), crs = CRS_wkt)

    # add also an outlier which is between the 'mediterranean', 'black' and 'red' sea
    # and will be assigned to the one that is closest in terms of distance
    outlier_lon = 34.8988917972
    outlier_lat = 35.0385655983
    dat_sf_update = data.frame(lon = outlier_lon, lat = outlier_lat)
    
    x_mt_update = geodist::geodist(x = dat[, 1:2], y = dat_sf_update, measure = 'geodesic')
    x_mt_update = data.table::data.table(x_mt_update)
    x_mt_update$sea = dat$sea
    x_mt_update = x_mt_update[order(x_mt_update$V1, decreasing = F), ]

    dat_sf_update_obs = data.frame(lon = outlier_lon, 
                                   lat = outlier_lat, 
                                   sea = x_mt_update$sea[1], 
                                   code = unique(subset(dat, sea == x_mt_update$sea[1])$code))   # it is assigned (based on distance) to the 'black sea' although it lies in the meditteranean
    dat_use = rbind(dat_sf_update_obs, dat)
    
    # leaflet map
    dat_sf_upd = sf::st_as_sf(dat_use, coords = c('lon', 'lat'), crs = CRS_wkt)
    mp = mapview::mapview(dat_sf_upd, zcol = 'sea', legend = TRUE)

    x = dat_use[, 1:2]
    lst_out = list(x = x, 
                   dataset = data_name, 
                   sea_names = dat_use$sea,
                   code = dat_use$code,
                   leaflet_map = mp)
  }
  else {
    stop(glue::glue("The dataset '{data_name}' does not exist!"), call. = FALSE)
  }
  
  return(lst_out)
}


‘Partion Around Medoids’ R package implementations


The next function includes the

implementations and will allow comparing the dissimilarity cost and elapsed time for the mentioned datasets. In the ClusterR package, I included the ClusterR::cost_clusters_from_dissim_medoids() function that takes

and returns the total dissimilarity cost.


compare_medoid_implementations = function(x,
                                          y = NULL,
                                          max_k = 15,
                                          geodesic_dist = FALSE,
                                          round_digits = 5,
                                          compute_rand_index = FALSE) {
  if (!geodesic_dist) {
    x = ClusterR::center_scale(data = x, mean_center = TRUE, sd_scale = TRUE)
    x_mt = ClusterR::distance_matrix(x, method = "euclidean", upper = TRUE, diagonal = TRUE)
  }
  else {
    x_mt = geodist::geodist(x = x, measure = 'geodesic')
  }
  
  dv = as.vector(stats::dist(x))    # compute distance matrix for 'fastkmedoids' (lower triangular part)
  results = cluster_out = list()
  
  for (k in 2:max_k) {

    # cluster [ 'original' ]
    t_start = proc.time()
    set.seed(k)
    pm = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'original')
    pm_secs = as.numeric((proc.time() - t_start)["elapsed"])
    pm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm$id.med)$cost
    
    # cluster [ 'faster' ]
    t_start = proc.time()
    set.seed(k)
    pm_fast = cluster::pam(x = x_mt, k, metric = "euclidean", do.swap = TRUE, stand = FALSE, variant = 'faster')
    pm_fast_secs = as.numeric((proc.time() - t_start)["elapsed"])
    pm_fast_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = pm_fast$id.med)$cost
    
    # kmed
    t_start = proc.time()
    km = kmed::skm(distdata = x_mt, ncluster = k, seeding = k, iterate = 10)
    km_secs = as.numeric((proc.time() - t_start)["elapsed"])
    km_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = km$medoid)$cost
    
    # fastkmedoids  ('pam' function)
    t_start = proc.time()
    set.seed(k)
    fkm = fastkmedoids::pam(rdist = dv, n = nrow(x), k = k)
    fkm_secs = as.numeric((proc.time() - t_start)["elapsed"])
    fkm_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm@medoids + 1)$cost       # output indices correspond to Cpp indexing, thus add 1

    # fastkmedoids  ('fastpam' function with 'initializer' set to 'LAB')
    t_start = proc.time()
    fkm_lab = fastkmedoids::fastpam(rdist = dv, n = nrow(x), k = k, initializer = "LAB", seed = k)
    fkm_lab_secs = as.numeric((proc.time() - t_start)["elapsed"])
    fkm_lab_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = fkm_lab@medoids + 1)$cost       # output indices correspond to Cpp indexing, thus add 1
    
    # ClusterR
    t_start = proc.time()
    clst = ClusterR::Cluster_Medoids(data = x_mt, clusters = k, verbose = FALSE, swap_phase = TRUE, seed = k)
    clst_secs = as.numeric((proc.time() - t_start)["elapsed"])
    clst_cost = ClusterR::cost_clusters_from_dissim_medoids(data = x_mt, medoids = clst$medoids)$cost
    
    clust_algos = c('cluster_pam',
                    'cluster_pam_fast',
                    'kmed_skm',
                    'fastkmedoids_pam',
                    'fastkmedoids_fastpam',
                    'ClusterR_Medoids')
    
    dissim_costs = c(round(pm_cost, digits = round_digits),
                     round(pm_fast_cost, digits = round_digits),
                     round(km_cost, digits = round_digits),
                     round(fkm_cost, digits = round_digits),
                     round(fkm_lab_cost, digits = round_digits),
                     round(clst_cost, digits = round_digits))
    
    time_bench = c(round(pm_secs, digits = round_digits),
                   round(pm_fast_secs, digits = round_digits),
                   round(km_secs, digits = round_digits),
                   round(fkm_secs, digits = round_digits),
                   round(fkm_lab_secs, digits = round_digits),
                   round(clst_secs, digits = round_digits))
    
    dtbl = list(pam_R_function = clust_algos, 
                dissim_cost = dissim_costs,
                timing = time_bench,
                k = rep(k, length(clust_algos)))
    
    if (compute_rand_index) {
      # rand-index (or accuracy)
      clust_acc = list(k = k,
                       cluster_pam = ClusterR::external_validation(true_labels = y, clusters = pm$clustering, method = 'rand_index'),
                       cluster_pam_fast = ClusterR::external_validation(true_labels = y, clusters = pm_fast$clustering, method = 'rand_index'),
                       kmed_skm = ClusterR::external_validation(true_labels = y, clusters = km$cluster, method = 'rand_index'),
                       fastkmedoids_pam = ClusterR::external_validation(true_labels = y, clusters = fkm@assignment, method = 'rand_index'),
                       fastkmedoids_fastpam = ClusterR::external_validation(true_labels = y, clusters = fkm_lab@assignment, method = 'rand_index'),
                       ClusterR_Medoids = ClusterR::external_validation(true_labels = y, clusters = clst$clusters, method = 'rand_index'))
      
      data.table::setDT(clust_acc)
      cluster_out[[k]] = clust_acc
    }
    
    data.table::setDT(dtbl)
    results[[k]] = dtbl
  }
  
  results = data.table::rbindlist(results)
  
  if (compute_rand_index) {
    return(list(results = results, cluster_out = cluster_out))
  }
  else {
   return(results) 
  }
}


In a for-loop, we’ll iterate over the datasets and the ‘partition around medoids’ implementations and we’ll save the output results to a list object,


dataset_names = c('rnorm_data', 'dietary_survey_IBS', 'soybean', 'agriculture', 'geospatial')
lst_all = list()
geo_flag = FALSE
y = NULL

for (dat_name in dataset_names) {
  
  iter_dat = datasets_clust(data_name = dat_name)
  cat(glue::glue("Dataset: '{dat_name}'  Number of rows: {nrow(iter_dat$x)}"), '\n')
  
  if (dat_name == 'geospatial') {
    mp_view = iter_dat$leaflet_map
    y = iter_dat$code
    geo_flag = TRUE
  }
  else {
    geo_flag = FALSE
  }

  iter_out = suppressWarnings(compare_medoid_implementations(x = iter_dat$x,
                                                             y = y,
                                                             max_k = 10, 
                                                             geodesic_dist = geo_flag,
                                                             round_digits = 5,
                                                             compute_rand_index = geo_flag))
  lst_all[[dat_name]] = iter_out
}

## Dataset: 'rnorm_data'  Number of rows: 100 
## Dataset: 'dietary_survey_IBS'  Number of rows: 400 
## Dataset: 'soybean'  Number of rows: 307 
## Dataset: 'agriculture'  Number of rows: 12 
## Dataset: 'geospatial'  Number of rows: 1001


then we’ll use a ggplot2 visualization function to visualize the dissimilarity cost and elapsed time for each k from 2 to 10


multi_plot_data = function(data_index, y_column, digits = 2, size_geom_bar_text = 7) {
  
  data_index$k = factor(data_index$k)               # convert the 'k' column to factor
  levels(data_index$k) = as.factor(glue::glue("k = {levels(data_index$k)}"))
  data_index[[y_column]] = round(data_index[[y_column]], digits = digits)

  plt = ggplot2::ggplot(data_index, ggplot2::aes(x = factor(pam_R_function), y = .data[[y_column]], fill = factor(pam_R_function))) +
    ggplot2::geom_bar(stat = "identity") +
    ggplot2::facet_wrap(~k, scales = "free_y") +
    ggplot2::geom_text(ggplot2::aes(label = .data[[y_column]]), position = ggplot2::position_dodge(width = 0.9), col = 'black', face = 'bold', size = size_geom_bar_text) +
    ggplot2::theme(panel.background = ggplot2::element_rect(fill = "white", colour = "#6D9EC1", size = 3, linetype = "solid"),
          strip.text.x = ggplot2::element_text(size = 20, colour = 'black', face = 'bold'),
          strip.background = element_rect(fill = 'orange', colour = 'black'),
          plot.title = ggplot2::element_text(size = 19, hjust = 0.5),
          axis.text = ggplot2::element_text(size = 19, face = 'bold'),
          axis.text.x = ggplot2::element_text(size = 25, face = 'bold', angle = 35, hjust = 1),
          axis.title = ggplot2::element_text(size = 19, face = 'bold'),
          axis.title.x = ggplot2::element_blank(),
          panel.grid = ggplot2::element_blank(),
          legend.position = "none")
  
  return(plt)
}


Visualization of the dissimilarity cost for each one of the datasets


rnorm_data dataset

y_axis = 'dissim_cost'

data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


dietary_survey_IBS dataset

data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


soybean dataset

data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


agriculture dataset

data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


geospatial dataset

data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis, size_geom_bar_text = 5)


For the ‘geospatial’ data I also computed the Rand Index (or accuracy) and for k = 4 (which is the actual number of clusters as the following leaflet map shows) we have a perfect clustering (or a rand-index of 1.0). I also included a single outlier (a pair of coordinates) between the ‘mediterranean’, ‘black’ and ‘red’ sea, which shall be assigned to the ‘black’ sea based on the distance (which happens in all implementations),

dtbl_rand_index = data.table::rbindlist(lst_all[[data_name]]$cluster_out)
knitr::kable(round(dtbl_rand_index, digits = 3))
k cluster_pam cluster_pam_fast kmed_skm fastkmedoids_pam fastkmedoids_fastpam ClusterR_Medoids
2 0.624 0.624 0.624 0.624 0.624 0.624
3 0.874 0.874 0.875 0.846 0.846 0.875
4 1.000 1.000 1.000 1.000 1.000 1.000
5 0.969 0.969 0.969 0.969 0.969 0.969
6 0.938 0.938 0.937 0.938 0.938 0.938
7 0.927 0.906 0.907 0.927 0.927 0.927
8 0.917 0.896 0.896 0.917 0.917 0.917
9 0.886 0.886 0.865 0.885 0.885 0.885
10 0.875 0.854 0.855 0.875 0.875 0.855


The Leaflet map shows the 4 clusters of the geospatial dataset which gives a rand-index of 1.0,

mp_view


Visualization of the elapsed time for each one of the datasets


rnorm_data dataset

y_axis = 'timing'

data_name = 'rnorm_data'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


dietary_survey_IBS dataset

data_name = 'dietary_survey_IBS'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


soybean dataset

data_name = 'soybean'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


agriculture dataset

data_name = 'agriculture'
multi_plot_data(data_index = lst_all[[data_name]], y_column = y_axis)


geospatial dataset

data_name = 'geospatial'
multi_plot_data(data_index = lst_all[[data_name]]$results, y_column = y_axis)


Conclusions


Notes

All ‘Partition Around Medoid’ functions take a dissimilarity matrix as input and not the initial input data, therefore the elapsed time does not include the computation of the dissimilarity (or distance) matrix.


To leave a comment for the author, please follow the link and comment on their blog: mlampros.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.