Site icon R-bloggers

Use the k-means clustering, Luke

[This article was first published on R on R-house, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post I scraped some character statistics from the mobile game Star Wars: Galaxy of Heroes. In this post, I’ll be aiming to try out k-means clustering in order to see if it comes out with an intuitive result, and to learn how to integrate this kind of analysis into a tidy workflow using broom.

First I’ll load the required packages and set some plot preferences.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(broom)
theme_set(theme_bw())

Here is the data I had previously scraped. It contains 176 Star Wars characters, each with 16 performance attributes.

swgoh_stats <- read_csv("content/post/data/unsupervised-learning/swgoh_stats.csv")
## Parsed with column specification:
## cols(
##   `Character Name` = col_character(),
##   Power = col_double(),
##   Speed = col_double(),
##   Health = col_double(),
##   `Max Ability` = col_double(),
##   `Physical Dmg` = col_double(),
##   `Physical Crit` = col_double(),
##   `Special Dmg` = col_double(),
##   `Special Crit` = col_double(),
##   `Armor Pen` = col_double(),
##   `Resistance Pen` = col_double(),
##   Potency = col_double(),
##   Protection = col_double(),
##   Armor = col_double(),
##   Resistance = col_double(),
##   Tenacity = col_double(),
##   `Health Steal` = col_double()
## )
glimpse(swgoh_stats)
## Observations: 176
## Variables: 17
## $ `Character Name` <chr> "Aayla Secura", "Admiral Ackbar", "Ahsoka Tano"…
## $ Power            <dbl> 18972, 18972, 21378, 21378, 21378, 24358, 25143…
## $ Speed            <dbl> 145, 139, 125, 168, 110, 124, 167, 165, 131, 13…
## $ Health           <dbl> 32108, 35392, 27138, 30636, 47459, 35381, 34870…
## $ `Max Ability`    <dbl> 11375, 6020, 11223, 10728, 4485, 7633, 11982, 4…
## $ `Physical Dmg`   <dbl> 3602, 3286, 3657, 3617, 2056, 2970, 3804, 4207,…
## $ `Physical Crit`  <dbl> 1022, 333, 1110, 1079, 470, 734, 869, 812, 575,…
## $ `Special Dmg`    <dbl> 2356, 3718, 1469, 1770, 3366, 2772, 1559, 1363,…
## $ `Special Crit`   <dbl> 0, 60, 30, 40, 75, 115, 0, 60, 0, 90, 225, 565,…
## $ `Armor Pen`      <dbl> 184, 75, 329, 164, 5, 144, 307, 517, 81, 35, 5,…
## $ `Resistance Pen` <dbl> 15, 37, 0, 0, 60, 197, 0, 5, 5, 5, 60, 185, 0, …
## $ Potency          <dbl> 0.41, 0.36, 0.23, 0.38, 0.07, 0.43, 0.33, 0.72,…
## $ Protection       <dbl> 41388, 39690, 34900, 39500, 54640, 36710, 44131…
## $ Armor            <dbl> 309, 454, 295, 301, 488, 289, 278, 217, 508, 41…
## $ Resistance       <dbl> 116, 423, 110, 152, 471, 124, 104, 76, 431, 253…
## $ Tenacity         <dbl> 0.50, 0.60, 0.33, 0.54, 0.56, 0.42, 0.38, 0.29,…
## $ `Health Steal`   <dbl> 0.15, 0.15, 0.35, 0.15, 0.30, 0.55, 0.20, 0.05,…

Some things to note straight off the bat. This algorithm can only deal with numeric data as it calculates distances from centroids. We also have to provide it with a number of clusters (centers) which requires either some a priori knowledge of the data, or some experimentation (or both).

There is also a stochastic element to the algorithm as the first step assigns each point to a random cluster. The algorithm can be repeated several times (with the nstart parameter) and the best performing run used. This means that if reproducibility is needed, a random number seed must be set. I’ll be setting a new seed every time I generate some analysis with specific attributions I’d like to have reproduced.

First, let’s have a look at a single kmeans() model using an arbitrary 3 clusters:

set.seed(1)
test_km <- kmeans(swgoh_stats[,-1], centers = 3, nstart = 50)
test_km
## K-means clustering with 3 clusters of sizes 81, 51, 44
## 
## Cluster means:
##      Power    Speed   Health Max Ability Physical Dmg Physical Crit
## 1 21025.85 149.7654 34730.53    8365.827     3255.432      730.3086
## 2 21115.14 158.9608 32173.96    9040.569     3320.804      802.4118
## 3 21369.43 141.5455 42204.59    7059.500     3200.682      566.2045
##   Special Dmg Special Crit Armor Pen Resistance Pen   Potency Protection
## 1    2652.531     97.55556  149.7160       40.86420 0.4234568   41969.47
## 2    2673.176    121.09804  169.5098       54.33333 0.4323529   34069.16
## 3    2194.000     45.79545  105.5000       11.31818 0.4095455   51363.80
##      Armor Resistance  Tenacity Health Steal
## 1 366.1728   238.2222 0.4413580    0.1709877
## 2 311.6471   209.3333 0.4249020    0.1568627
## 3 517.8409   363.3864 0.4929545    0.1977273
## 
## Clustering vector:
##   [1] 1 1 2 1 3 2 1 2 1 1 3 1 1 1 1 1 2 1 3 3 1 1 1 1 1 1 2 3 1 3 2 3 1 3 1
##  [36] 2 1 2 1 3 2 3 1 1 2 2 3 1 1 2 1 1 2 3 1 1 2 1 1 2 2 3 1 1 1 3 2 3 2 1
##  [71] 3 3 1 3 2 2 1 2 2 2 1 2 3 1 2 3 3 2 2 3 1 1 1 1 2 1 2 1 1 2 1 3 2 3 3
## [106] 1 1 1 3 2 1 2 2 2 1 1 1 1 3 2 3 2 3 1 1 3 3 2 3 1 1 2 2 1 2 1 1 1 2 2
## [141] 1 2 2 3 2 1 3 3 3 2 1 3 1 3 1 3 1 1 3 2 3 1 1 1 1 3 1 1 1 3 2 1 2 2 3
## [176] 1
## 
## Within cluster sum of squares by cluster:
## [1] 2788849852 2446171455 2763654327
##  (between_SS / total_SS =  54.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Printing the model object gives us some useful information, including cluster sizes, cluster means (centroids), a vector of cluster attributions for each character, and some information about the performance of the clustering. For k-means, the objective is to maximise the between-cluster sum of squares (variance) and minimise the within-cluster sum of squares, i.e. have tight clusters that are well separated. The total sum of squares is the total of these two figures, and so ideally we’d like to have as much of the variance as possible coming from the between-cluster variance instead of the within-cluster variance. Hence, 54.9% is a measure of cluster performance. If every point was its own cluster, it would be 100%.

Looking at the summary, we get some slightly different information, which indicates that integrating this model into a tidy workflow may be tricky as it is essentially a list of different sized elements.

summary(test_km)
##              Length Class  Mode   
## cluster      176    -none- numeric
## centers       48    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

This is where the broom package comes in, which is installed with the tidyverse. This package is used for turning messy model objects like this into tidy tibbles and has three main functions:

The glance() function gives a single row of top-level metrics. This is useful for extracting performance measures.

glance(test_km)
## # A tibble: 1 x 4
##          totss tot.withinss   betweenss  iter
##          <dbl>        <dbl>       <dbl> <int>
## 1 17741410664.  7998675634. 9742735030.     3

The tidy() function gives a row for each cluster. I can’t imagine I’d use this one too often for k-means.

tidy(test_km)
## # A tibble: 3 x 19
##       x1    x2     x3    x4    x5    x6    x7    x8    x9   x10   x11
##    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21026.  150. 34731. 8366. 3255.  730. 2653.  97.6  150.  40.9 0.423
## 2 21115.  159. 32174. 9041. 3321.  802. 2673. 121.   170.  54.3 0.432
## 3 21369.  142. 42205. 7060. 3201.  566. 2194   45.8  106.  11.3 0.410
## # … with 8 more variables: x12 <dbl>, x13 <dbl>, x14 <dbl>, x15 <dbl>,
## #   x16 <dbl>, size <int>, withinss <dbl>, cluster <fct>

The augment() function gives a row for each observation. You not only have to provide the model object, but also the original data (exactly as it was originally used) and it augments it with cluster attribution. This is where the main results lie.

augment(test_km, swgoh_stats[,-1])
## # A tibble: 176 x 17
##    Power Speed Health Max.Ability Physical.Dmg Physical.Crit Special.Dmg
##    <dbl> <dbl>  <dbl>       <dbl>        <dbl>         <dbl>       <dbl>
##  1 18972   145  32108       11375         3602          1022        2356
##  2 18972   139  35392        6020         3286           333        3718
##  3 21378   125  27138       11223         3657          1110        1469
##  4 21378   168  30636       10728         3617          1079        1770
##  5 21378   110  47459        4485         2056           470        3366
##  6 24358   124  35381        7633         2970           734        2772
##  7 25143   167  34870       11982         3804           869        1559
##  8 21378   165  20693        4205         4207           812        1363
##  9 21378   131  36543        4020         3126           575        2226
## 10 21378   136  37121        4215         3332           904        2487
## # … with 166 more rows, and 10 more variables: Special.Crit <dbl>,
## #   Armor.Pen <dbl>, Resistance.Pen <dbl>, Potency <dbl>,
## #   Protection <dbl>, Armor <dbl>, Resistance <dbl>, Tenacity <dbl>,
## #   Health.Steal <dbl>, .cluster <fct>

The question then is how to implement these into a tidy workflow, especially when you may want to experiment with many different models. After reading about it online and some experimentation, I think I have it figured out. The trick seems to involve 3 steps:

  1. Create a tibble with a list column of models;
  2. Use one of the broom functions above to create another list column, dependent on the question being asked;
  3. Unnest the broom-derived list column to surface the appropriate model parameters in the tibble.

As an example, the first thing to do here is to find an appropriate number of clusters to use. Therefore I create a tibble with a row for a number of clusters from 1 to 20, and map each of these numbers to the kmeans algorithm to create a column of models. Since I’m interested in how model performance changes with the number of clusters, I map each model to the glance() function and unnest it. I can then generate a scree plot to try to find an appropriate elbow in the curve. Instead of using the total within-cluster sum of squares, I normalise it as it I can interpret it more usefully.

tibble(clusters = 1:20) %>% 
  mutate(mod = map(clusters, ~ kmeans(swgoh_stats[,-1], centers = .x, nstart = 50))) %>%
  mutate(glanced = map(mod, glance)) %>%
  unnest(glanced) %>%
  ggplot(aes(x=clusters, y = tot.withinss/totss)) +
  geom_line()

The scree plot says that when there is once cluster performance is at its worst possible, however it’s difficult to choose a number of clusters as there’s no obvious knee (apart from the one at clusters = 2, which has performance of around 43%). I’ll initially go with 10 as that seems to give around 80% performance.

Now I’ve chosen a number of clusters, I’ll look at how characters are clustered for each attribute. This time I use the augment() function to get cluster attributions. Due to the size of the data, I also only use a random 50% of the data from each cluster:

set.seed(123)
kmeans(swgoh_stats[,-1], centers = 10, nstart = 50) %>%
  augment(swgoh_stats[,-1]) %>%
  bind_cols(swgoh_stats[,1]) %>%
  mutate(`Character Name` = fct_reorder(`Character Name`, paste0(.cluster))) %>%
  group_by(.cluster) %>% 
  sample_frac(0.5) %>% 
  ungroup() %>% 
  gather(attribute, value, -`Character Name`, -.cluster) %>%
  ggplot(aes(x = `Character Name`, y = value, fill = .cluster)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~attribute, drop = TRUE, scales = "free") +
  theme(axis.text.y = element_blank())

In this plot, I’m surprised to see that it’s difficult to distinguish between clusters, except for health and protection. Since these have by far the biggest values and I don’t want to cluster by these alone, it’s clear I have to scale the values. One thing I learned is that the scale() function is used for scaling and centering, and it does both by default. It also returns a matrix, so I have to turn it back into a tibble.

Let’s redo the scree plot.

swgoh_scaled <- scale(swgoh_stats[,-1]) %>% as_tibble()

tibble(clusters = 1:20) %>% 
  mutate(mod = map(clusters, ~ kmeans(swgoh_scaled, centers = .x, nstart = 50))) %>%
  mutate(glanced = map(mod, glance)) %>%
  unnest(glanced) %>%
  ggplot(aes(x=clusters, y = tot.withinss/totss)) +
  geom_line() 

The scree plot doesn’t look too different, so I’ll stick with 10 clusters for now.

set.seed(124)
scaled_clustering <- kmeans(swgoh_scaled, centers = 10, nstart = 50) %>%
  augment(swgoh_scaled) %>%
  bind_cols(swgoh_stats[,1]) 

Let’s see if the clustering has improved across attributes:

scaled_clustering %>%
  mutate(`Character Name` = fct_reorder(`Character Name`, paste0(.cluster))) %>%
  group_by(.cluster) %>% 
  sample_frac(0.5) %>% 
  ungroup() %>% 
  gather(attribute, value, -`Character Name`, -.cluster) %>%
  ggplot(aes(x = `Character Name`, y = value, fill = .cluster)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~attribute, drop=TRUE, scales = "free") +
  theme(axis.text.y = element_blank())

This seems much better. Next, I’d like to see which characters are being clustered together to see if I can spot any patterns. As there’s no wordcloud geom for ggplot2 I just create a bit of a simple one, using ggrepel to avoid overlapping character names as much as possible.

scaled_clustering %>% 
  select(`Character Name`, .cluster) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)

From what I know about the game, some interesting things pop out to me straightaway:

Spotting these patterns is quite encouraging! I wasn’t originally intending to do this, but I am fascinated to find out how these clusters map to character roles. There’s no way I can find of downloading this data, so I manually created it from data on the website:

swgoh_roles <- read_csv("content/post/data/unsupervised-learning/swgoh_roles.csv") %>%
  mutate(role = case_when(!is.na(attacker) ~ "Attacker",
                          !is.na(support) ~ "Support",
                          !is.na(tank) ~ "Tank",
                          !is.na(healer) ~ "Healer")) %>% 
  select(`Character Name`, role)
## Parsed with column specification:
## cols(
##   `Character Name` = col_character(),
##   attacker = col_double(),
##   healer = col_double(),
##   support = col_double(),
##   tank = col_double()
## )
glimpse(swgoh_roles)
## Observations: 176
## Variables: 2
## $ `Character Name` <chr> "Aayla Secura", "Admiral Ackbar", "Ahsoka Tano"…
## $ role             <chr> "Support", "Support", "Attacker", "Attacker", "…

Let’s recreate the ten ‘wordclouds’ to see how the characters are being clustered by role:

scaled_clustering %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

It does seem that my observations were correct, and tanks are indeed focussed in cluster 7. Cluster 8 also seems to exclusively contain attackers, with the role also being prevalent in clusters 4 and 9. Clusters 2 and 10 seem to be focussed on support roles.

Given there are 4 roles, I’m curious to see how well k-means is able to distinguish between them if we assume 4 clusters:

set.seed(125)
kmeans(swgoh_scaled, centers = 4, nstart = 50) %>%
  augment(swgoh_scaled) %>%
  bind_cols(swgoh_stats[,1]) %>%
  left_join(swgoh_roles) %>%
  select(`Character Name`, .cluster, role) %>%
  mutate(x = runif(n(), 0, 10),
         y = runif(n(), 0, 10)) %>%
  ggplot(aes(x=x,y=y, label=`Character Name`)) +
  ggrepel::geom_text_repel(aes(col = role), size = 3) +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  facet_wrap(~.cluster)
## Joining, by = "Character Name"

Interesting! Three of the clusters seem to be capuring support, attacker, and tank characters pretty well, however the fourth cluster seems to be a bit of a mess…this is probably because healing abilities are given to a range of characters.

This seems to be a pretty good algorithm, I’m interested to see how others compare. Next up, hierarchical clustering!

To leave a comment for the author, please follow the link and comment on their blog: R on R-house.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.