R for more powerful clustering
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Vidisha Vachharajani
Freelance Statistical Consultant
R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A “colorful visual representation of data in a matrix” or “a (thematic) map in which areas are represented in patterns (“heat” colors) that are proportionate to the measurement of some information being displayed on the map”? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.
The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, “heatmap.2”) a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).
Consider the following simple example which uses the “States” data sets in the car package. States contains the following features:
- region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
- pop: Population: in 1,000s.
- SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
- SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
- percent: Percentage of graduating high-school students in the state who took the SAT exam.
- dollars: State spending on public education, in $1000s per student.
- pay: Average teacher's salary in the state, in $1000s.
We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.
# R CODE (output = "initial_plot.png") library(gplots) # contains the heatmap.2 package library(car) States[1:3,] # look at the data scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels scale="none", # we have already scaled the data trace="none") # cleaner heatmap
This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an "hclust()" rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an "all-the-information-in-one-glance" look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a "red-yellow-green" effect for the scaling, which will render the underlying information even more clearly. Finally, we'll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).
# R CODE (output = "final_plot.png") # Use color brewer library(RColorBrewer) my_palette <- colorRampPalette(c('red','yellow','green'))(256) scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels col = my_palette, # arguments to read in custom colors colsep=c(2,4,5), # Adding on the separators that will clarify plot even more rowsep = c(6,14,18,25,30,36,42,47), sepcolor="black", sepwidth=c(0.01,0.01), scale="none", # we have already scaled the data dendrogram="none", # no need to see dendrograms in this one trace="none") # cleaner heatmap
This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.
This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.