Learning Data Science: Understanding and Using k-means Clustering
Posted on July 23, 2019 by Learning Machines in R bloggers | 0 Comments
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A few months ago I published a quite popular post on Clustering the Bible… one well known clustering algorithm is k-means. If you want to learn how k-means works and how to apply it in a real-world example, read on…
k-means (not to be confused with k-nearest neighbours or KNN: Teach R to read handwritten Digits with just 4 Lines of Code) is a simple, yet often very effective unsupervised learning algorithm to find similarities in large amounts of data and cluster them accordingly. The k stands for the number of clusters which has to be set beforehand.
The guiding principles are:
- The distance between data points within clusters should be as small as possible.
- The distance of the centroids (= centre of the clusters) should be as big as possible.
Because there are too many possible combinations of all possible clusters comprising all possible data points k-means follows an iterative approach:
- Initialization: assign clusters randomly to all data points
- E-step (for expectation): assign each observation to the “nearest” (based on Euclidean distance) cluster
- M-step (for maximization): determine new centroids based on the mean of assigned objects
- Repeat steps 3 and 4 until no further changes occur
As can be seen above k-means is an example of a so-called expectation-maximization algorithm.
To implement k-means in R we first assign some variables and define a helper function for plotting the steps:
n <- 3 # no. of centroids
set.seed(1415) # set seed for reproducibility
M1 <- matrix(round(runif(100, 1, 5), 1), ncol = 2)
M2 <- matrix(round(runif(100, 7, 12), 1), ncol = 2)
M3 <- matrix(round(runif(100, 20, 25), 1), ncol = 2)
M <- rbind(M1, M2, M3)
C <- M[1:n, ] # define centroids as first n objects
obs <- length(M) / 2
A <- sample(1:n, obs, replace = TRUE) # assign objects to centroids at random
colors <- seq(10, 200, 25)
clusterplot <- function(M, C, txt) {
plot(M, main = txt, xlab = "", ylab = "")
for(i in 1:n) {
points(C[i, , drop = FALSE], pch = 23, lwd = 3, col = colors[i])
points(M[A == i, , drop = FALSE], col = colors[i])
}
}
clusterplot(M, C, "Initialization")

Here comes the k-means algorithm as described above:
repeat {
# calculate Euclidean distance between objects and centroids
D <- matrix(data = NA, nrow = n, ncol = obs)
for(i in 1:n) {
for(j in 1:obs) {
D[i, j] <- sqrt((M[j, 1] - C[i, 1])^2 + (M[j, 2] - C[i, 2])^2)
}
}
O <- A
## E-step: parameters are fixed, distributions are optimized
A <- max.col(t(-D)) # assign objects to centroids
if(all(O == A)) break # if no change stop
clusterplot(M, C, "E-step")
## M-step: distributions are fixed, parameters are optimized
# determine new centroids based on mean of assigned objects
for(i in 1:n) {
C[i, ] <- apply(M[A == i, , drop = FALSE], 2, mean)
}
clusterplot(M, C, "M-step")
}








As can seen the clusters wander slowly but surely until all three are stable. We now compare the result with the k-means function in Base R:
cl <- kmeans(M, n) clusterplot(M, cl$centers, "Base R")

(custom <- C[order(C[ , 1]), ]) ## [,1] [,2] ## [1,] 3.008 2.740 ## [2,] 9.518 9.326 ## [3,] 22.754 22.396 (base <- cl$centers[order(cl$centers[ , 1]), ]) ## [,1] [,2] ## 2 3.008 2.740 ## 1 9.518 9.326 ## 3 22.754 22.396 round(base - custom, 13) ## [,1] [,2] ## 2 0 0 ## 1 0 0 ## 3 0 0
As you can see, the result is the same!
Now, for some real-world application: clustering wholesale customer data. The data set refers to clients of a wholesale distributor. It includes the annual spending on diverse product categories and is from the renowned UCI Machine Learning Repository (I guess the category “Delicassen” should rather be “Delicatessen”).
Have a look at the following code:
data <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv", header = TRUE)
head(data)
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 4 1 3 13265 1196 4221 6404 507 1788
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
set.seed(123)
k <- kmeans(data[ , -c(1, 2)], centers = 4) # remove columns 1 and 2, create 4 clusters
(centers <- k$centers) # display cluster centers
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 8149.837 18715.857 27756.592 2034.714 12523.020 2282.143
## 2 20598.389 3789.425 5027.274 3993.540 1120.142 1638.398
## 3 48777.375 6607.375 6197.792 9462.792 932.125 4435.333
## 4 5442.969 4120.071 5597.087 2258.157 1989.299 1053.272
round(prop.table(centers, 2) * 100) # percentage of sales per category
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 10 56 62 11 76 24
## 2 25 11 11 22 7 17
## 3 59 20 14 53 6 47
## 4 7 12 13 13 12 11
table(k$cluster) # number of customers per cluster
##
## 1 2 3 4
## 49 113 24 254
One interpretation could be the following for the four clusters:
- Big general shops
- Small food shops
- Big food shops
- Small general shops
As you can see, the interpretation of some clusters found by the algorithm can be quite a challenge. If you have a better idea of how to interpret the result please tell me in the comments below!
Related
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.