Site icon R-bloggers

Measuring feature importance in k-means clustering and variants thereof

[This article was first published on R on Datentrang, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We present a novel approach for measuring feature importance in k-means clustering, or variants thereof, to increase the interpretability of clustering results. In supervised machine learning, feature importance is a widely used tool to ensure interpretability of complex models. We adapt this idea to unsupervised learning via partitional clustering. Our approach is model agnostic in that it only requires a function that computes the cluster assignment for new data points.

Based on a simulation study below we show that the algorithm finds the variables which drive the cluster assignment and scores them according to their relevance. As a further application, this provides a new approach for hyperparameter tuning for data sets of mixed type when the metric is a linear combination of a numerical and a categorical distance measure – as in Gower’s distance, for example.

In combination with stability analyses, feature importance provides a means for feature selection, i.e. the identification of a lower dimensional subspace which offers a reasonable separation. Our package works with some popular clustering packages such as flexclust, clustMixType, base R’s kmeans function and the newly developed ClustImpute package.

Installation

You can install the package as follows:

devtools::install_github("o1iv3r/FeatureImpCluster")

Useage with flexclust

We’ll create some random data to illustrate the usage of FeatureImpCluster. It provides 4 clusters in a 2 dimensional subspace of a 6 dimensional space

library(FeatureImpCluster)
#> Lade nötiges Paket: data.table

set.seed(7)
dat <- create_random_data(n=4000,nr_other_vars = 4)
summary(dat$data)
#>        V1                  V2                  V3           
#>  Min.   :-3.531648   Min.   :-3.578032   Min.   :-3.924400  
#>  1st Qu.:-0.670694   1st Qu.:-0.676281   1st Qu.:-0.662992  
#>  Median :-0.001917   Median :-0.001944   Median :-0.002742  
#>  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
#>  3rd Qu.: 0.654912   3rd Qu.: 0.658228   3rd Qu.: 0.678405  
#>  Max.   : 3.501554   Max.   : 3.717284   Max.   : 3.065434  
#>        V4                 x                   y           
#>  Min.   :-3.91009   Min.   :-2.255326   Min.   :-2.04657  
#>  1st Qu.:-0.67427   1st Qu.:-0.934193   1st Qu.:-0.92633  
#>  Median : 0.01396   Median :-0.004383   Median : 0.04418  
#>  Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
#>  3rd Qu.: 0.67657   3rd Qu.: 0.927500   3rd Qu.: 0.92785  
#>  Max.   : 3.58167   Max.   : 2.095888   Max.   : 2.18437
library(ggplot2)
true_clusters <- factor(dat$true_clusters)
ggplot(dat$data,aes(x=x,y=y,color=true_clusters)) + geom_point()

If our clustering works well, x and y should determine the partition while the other variables V1,..,V4 should be irrelevant. Feature importance is a novel way to determine whether this is the case. We’ll use the flexclust package for this example. Its main function FeatureImpCluster computes the permutation missclassification rate for each variable of the data. The mean misclassification rate over all iterations is interpreted as variable importance. The permutation missclassification rate of a feature (column) is the number of wrong cluster assignments divided by the number of observations (rows) given a permutation of the feature.

library(FeatureImpCluster)
library(flexclust)
#> Lade nötiges Paket: grid
#> Lade nötiges Paket: lattice
#> Lade nötiges Paket: modeltools
#> Lade nötiges Paket: stats4
set.seed(10)
res <- kcca(dat$data,k=4)
FeatureImp_res <- FeatureImpCluster(res,as.data.table(dat$data))
plot(FeatureImp_res)

Indeed, y and x are most relevant. But also V2 has some impact on the cluster assignment. By looking at the cluster centers we see that, in particular, cluster 2 and 3 have a different center in the V2 dimension than the other clusters.

barplot(res)

# bwplot(res,dat$data), image(res,which=5:6) # alternative diagnostic plots of flexclust

If we had a lot more than 6 variables (and possibly more clusters), then the chart above might be hard to interpret. The feature importance plot instead provides an aggregate statistics per feature and is, as such, always easy to interpret, in particular since only the top x (say, 10 or 30) features can be considered to get a first impression.

Feature selection

We know that the clustering is impacted by the random initialization. Thus it is usually recommended to run the clustering alogrithm several times with different seeds. As a by-product, the feature importance will provide us a feature selection mechanism: instead of iterating over permutation, we can iterate over the different cluster runs (or both). This way there is a good chance that any spurious importance is identified as an outlier.

For our example we repeat the clustering + feature importance calculation 5 times:

set.seed(12)
nr_seeds <- 5
seeds_vec <- sample(1:1000,nr_seeds)

savedImp <- data.frame(matrix(0,nr_seeds,dim(dat$data)[2]))
count <- 1
for (s in seeds_vec) {
  set.seed(s)
  res <- kcca(dat$data,k=4)
  FeatureImp_res <- FeatureImpCluster(res,as.data.table(dat$data),sub = 1,biter = 1)
  savedImp[count,] <- FeatureImp_res$featureImp[sort(names(FeatureImp_res$featureImp))]
  count <- count + 1
}
names(savedImp) <- sort(names(FeatureImp_res$featureImp))

Now it becomes quite obvious that x and y are the only relevant features, and we could do our clustering only based on these features. This is importantant in practice since cluster centroids with a lower number of features are easier to interpret, and we can save time / money collecting and pre-processing unnecessary features.

boxplot(savedImp)

Data sets of mixed type

Another application arises for data sets with numerical and categorical features. Since one cannot simply calculate an Euclidean distance for categorical variables, one often uses an L0-norm (1 for equality, 0 else) for the latter and combines both metrices linearly with an appropriate weight (often this choice is referred to as Gower’s distance in the literature). In the clustMixType package the parameter lambda defines the trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. Feature Importance can be used as an additional guide to tune this parameter.

First we add categorical variables to our data set

ds <- as.data.table(dat$data)

n <- dim(ds)[1]
p <- dim(ds)[2]

set.seed(123)
ds[,cat1:=factor(rbinom(n,size=1,prob=0.3),labels = c("yes","no"))] # irrelevant factor
ds[,cat2:=factor(c(rep("yes",n/2),rep("no",n/2)))] # relevant factor

Obviously x and cat2 are strongly correlated.

cor(ds$x,as.numeric(ds$cat2),method="spearman")
#> [1] 0.8655712

First we’ll apply the clustering with an automatic estimation of lambda

library(clustMixType)
res <- kproto(x=ds,k=4)
#> # NAs in variables:
#>   V1   V2   V3   V4    x    y cat1 cat2 
#>    0    0    0    0    0    0    0    0 
#> 0 observation(s) with NAs.
#> 
#> Estimated lambda: 2.17156
res$lambda
#> [1] 2.17156

With color=“type” we can draw the attention to the importance by data type. While cat2 correctly has some importance, the one of cat1 is almost zero.

FeatureImp_res <- FeatureImpCluster(res,ds)
plot(FeatureImp_res,ds,color="type")

All in all the numeric variables are more important for the partitioning. If, for some reason, we wanted partitions that emphasize differences between the cateogrical features, we’d have to increase lambda. The feature importance directly shows us the effect of this action: the two categorical features now have an equally high importance only somewhat smaller than x. As above, repeated partitioning could be used to compute a more reasonable importance for the data set and not only an importance for a specific partition.

res2 <- kproto(x=ds,k=4,lambda=3)
#> # NAs in variables:
#>   V1   V2   V3   V4    x    y cat1 cat2 
#>    0    0    0    0    0    0    0    0 
#> 0 observation(s) with NAs.
plot(FeatureImpCluster(res2,ds),ds,color="type")

Of course, further criteria should be used to determine an “optimal” lamda for the use case at hand – but certainly featuer importance provides helpful guidance for data of mixed types.

Other methods: kmeans(), pam() and ClustImpute()

FeatureImpCluster can be easily used with other packages. For example, stats::kmeans or cluster::pam can be used via flexclust:

cl_kcca <- flexclust::as.kcca(cl, dat$data) # cl is a kcca or pam object
FeatureImpCluster(cl_kcca,as.data.table(dat$data))

ClustImpute, a package that efficiently imputes missing values while performing a k-means clustering can be used directly:

library(ClustImpute)
res_clustimpute <- ClustImpute(as.data.frame(dat$data),4)

FeatureImpCluster(res_clustimpute,as.data.table(dat$data))

For other methods, a custom prediction function can be provided (cf. documentation for details)

FeatureImpCluster(clusterObj, as.data.table(dat$data), predFUN = custom_prediction_function_for_clusterObj)

Further options

There are further options not being explained in the examples above:

  • For initialization, the prediction methodhas to be computed on the entire data set. This can be of high computational cost for large data sets. Alterntively one can provide the current partitioning via basePred.
  • To further spead up a computation on large data sets, the permutation importance can be computed on random subsets of the data. This can be controlled via the sub parameter
  • The number of iterations (default is 10) can be set via biter

To leave a comment for the author, please follow the link and comment on their blog: R on Datentrang.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.