[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Let’s say that you have applied your Clustering algorithm and you would like to report the distribution of the categorical variables per cluster in a “tidy” report. Below you can see a suggestion of how you can do it in R.
Generate the Data
Let’s assume that we came up with 3 clusters such as “C1, C2 and C3” and that we have 3 attributes such as:
- Gender: “M”, “F”
- Type: “A”, “B”, “C”, “D”
- Category: “High”, “Medium”, “Low”
library(tidyverse) set.seed(5) df1<-tibble(ID=seq_len(500))%>% mutate(Cluster = "C1", Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.6, 0.4)), Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.20, 0.3, 0.4, 0.1)), Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.6, 0.3))) df2<-tibble(ID=seq_len(300))%>% mutate(Cluster = "C2", Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.4, 0.6)), Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.40, 0.1, 0.2, 0.3)), Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.7, 0.2, 0.1))) df3<-tibble(ID=seq_len(200))%>% mutate(Cluster = "C3", Gender=sample(c("M", "F"), n(), replace=TRUE, prob=c(0.2, 0.8)), Type=sample(c("A", "B", "C", "D"), n(), replace=TRUE, prob=c(0.5, 0.3, 0.1, 0.1)), Category=sample(c("High", "Medium", "Low"), n(), replace=TRUE, prob=c(0.1, 0.2, 0.7))) df<-rbind.data.frame(df1, df2, df3) df # A tibble: 1,000 x 5 ID Cluster Gender Type Category <int> <chr> <chr> <chr> <chr> 1 1 C1 M C Medium 2 2 C1 F C Medium 3 3 C1 F C Medium 4 4 C1 M B Low 5 5 C1 M B Low 6 6 C1 F C Medium 7 7 C1 M C Medium 8 8 C1 F B High 9 9 C1 F C Medium 10 10 C1 M A Medium # ... with 990 more rows
Report the Distribution of Attributes
attributes <- names(df[3:dim(df)[2]]) output<-NULL for (a in attributes) { tmp<-df%>%group_by_(a, "Cluster")%>% summarise(n = n())%>% group_by(Cluster)%>%mutate(Prop=n/(sum(n)))%>% ungroup()%>%select(-n)%>% spread(Cluster, Prop)%>%mutate(Attribute = a)%>%select(Attribute, everything()) colnames(tmp)[1:2]<-c("attribute", "values") output<-rbind(output, tmp) } output # A tibble: 9 x 5 attribute values C1 C2 C3 <chr> <chr> <dbl> <dbl> <dbl> 1 Gender F 0.398 0.593 0.78 2 Gender M 0.602 0.407 0.22 3 Type A 0.188 0.413 0.425 4 Type B 0.318 0.1 0.365 5 Type C 0.39 0.193 0.105 6 Type D 0.104 0.293 0.105 7 Category High 0.114 0.683 0.065 8 Category Low 0.312 0.103 0.75 9 Category Medium 0.574 0.213 0.185
To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.