How to Report the Distribution of Attributes per Cluster

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.

Generate the Data

Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:

  • Gender: “M”, “F”
  • Type: “A”, “B”, “C”, “D”
  • Category: “High”, “Medium”, “Low”
library(tidyverse)

set.seed(5)

df1<-tibble(ID=seq_len(500))%>%
     mutate(Cluster = "C1",
            Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)),
            Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)),
            Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3)))

df2<-tibble(ID=seq_len(300))%>%
  mutate(Cluster = "C2",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1)))

df3<-tibble(ID=seq_len(200))%>%
  mutate(Cluster = "C3",
         Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)),
         Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)),
         Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7)))

df<-rbind.data.frame(df1, df2, df3)

df
 

# A tibble: 1,000 x 5
      ID Cluster Gender Type  Category
   <int> <chr>   <chr>  <chr> <chr>   
 1     1 C1      M      C     Medium  
 2     2 C1      F      C     Medium  
 3     3 C1      F      C     Medium  
 4     4 C1      M      B     Low     
 5     5 C1      M      B     Low     
 6     6 C1      F      C     Medium  
 7     7 C1      M      C     Medium  
 8     8 C1      F      B     High    
 9     9 C1      F      C     Medium  
10    10 C1      M      A     Medium  
# ... with 990 more rows

Report the Distribution of Attributes



attributes <- names(df[3:dim(df)[2]])


output<-NULL

for (a in attributes) {
  
  tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>%
    group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>%
    ungroup()%>%select(-n)%>%
    spread(Cluster, Prop)%>%mutate(Attribute = a)%>%select(Attribute, everything())
  colnames(tmp)[1:2]<-c("attribute", "values")
  
  output<-rbind(output, tmp)
  
}

output
 

# A tibble: 9 x 5
  attribute values    C1    C2    C3
  <chr>     <chr>  <dbl> <dbl> <dbl>
1 Gender    F      0.398 0.593 0.78 
2 Gender    M      0.602 0.407 0.22 
3 Type      A      0.188 0.413 0.425
4 Type      B      0.318 0.1   0.365
5 Type      C      0.39  0.193 0.105
6 Type      D      0.104 0.293 0.105
7 Category  High   0.114 0.683 0.065
8 Category  Low    0.312 0.103 0.75 
9 Category  Medium 0.574 0.213 0.185

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)