Site icon R-bloggers

R for more powerful clustering

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Vidisha Vachharajani
Freelance Statistical Consultant

R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A "colorful visual representation of data in a matrix" or "a (thematic) map in which areas are represented in patterns ("heat" colors) that are proportionate to the measurement of some information being displayed on the map"? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.

The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, "heatmap.2") a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).

Consider the following simple example which uses the "States" data sets in the car package. States contains the following features:

  • region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
  • pop: Population: in 1,000s.
  • SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
  • SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
  • percent: Percentage of graduating high-school students in the state who took the SAT exam.
  • dollars: State spending on public education, in $1000s per student.
  • pay: Average teacher's salary in the state, in $1000s.

We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.

# R CODE (output = "initial_plot.png")
library(gplots)   # contains the heatmap.2 package
library(car)    
States[1:3,] # look at the data
 
scaled <- scale(States[,-1]) # scale all but the first column to make information comparable
heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap
            cexRow=0.5, cexCol=0.95, # decrease  size of row/column labels
            scale="none", # we have already scaled the data
            trace="none") # cleaner heatmap

This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).

# R CODE (output = "final_plot.png")
 
# Use color brewer
library(RColorBrewer)
my_palette <- colorRampPalette(c('red','yellow','green'))(256)
 
scaled <- scale(States[,-1])    # scale all but the first column to make information comparable
heatmap.2(scaled,               # specify the (scaled) data to be used in the heatmap
          cexRow=0.5, 
          cexCol=0.95,          # decrease  size of row/column labels
          col = my_palette,     # arguments to read in custom colors
          colsep=c(2,4,5),      # Adding on the separators that will clarify plot even more
          rowsep = c(6,14,18,25,30,36,42,47), 
          sepcolor="black", 
          sepwidth=c(0.01,0.01),  
          scale="none",         # we have already scaled the data 
          dendrogram="none",    # no need to see dendrograms in this one 
          trace="none")         # cleaner heatmap

This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.

This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.