Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently I’ve been working on a review of urban energy modelling literature. It’s a very broad field and a quick search through Web of Knowledge turns up about 400 papers that look relevant. How on earth can you distill these reams of paper into something sensible?
One technique I’ve found helpful, especially at the earlier stages of a literature review when you’re trying to get the big picture straight, is to use clustering techniques. There are many different algorithms depending on what the goal of your analysis is, but here I use a two-step process.
- Hierarchical clustering These methods start by assuming that every point in your data set represents a unique cluster. The algorithms then successively merge clusters, by measuring the “distance” or “dissimilarity” between data points, until there is only one large cluster containing all of the data. The results of this process can be plotted as a dendogram, from which the number of clusters within the data can be identified.
- Partitioning clustering Once you know how many clusters are in your data set, you can use a partitioning method. These methods divide the data into a fixed number of clusters and can report data about the typical characteristics and membership of each group.
For both of these steps, R’s cluster package provides all of the methods you will need.
To demonstrate, here are the first few rows of the data within my literature review. As you can see, I’ve categorized each paper along several different criteria representing the spatial and temporal scale of the model, the application domain, and the model’s treatment of energy supply and demand variables. Your data will of course have different categories depending on the subject of interest but the general structure is likely to be similar.
> head(data) Spatial Temporal Family Supply Demand Category 1 Technology Daily Optimization endogenous none Technology 2 City Daily Regression none endogenous Demand estimation 3 Building Annual Regression exogenous endogenous Demand estimation 4 City Annual Regression none endogenous Demand estimation 5 Technology Sub-hourly Simulation endogenous none Technology 6 Building Annual Various endogenous exogenous Descriptive
The data in this table represent a mix of ordinal and nominal data; that is, categorical data with and without an inherent order respectively. Both data types are represented in R by a factor object, but these have to be declared before attempting the analysis. The following code therefore modifies the data frame and categorizes the variables appropriately. (Note that I’ve also explicitly coded the levels; this isn’t necessary with nominal data but often is with ordinal values. The ordered=TRUE
argument creates an ordered factor.)
data <- transform(data, Spatial=factor(Spatial, levels=c("Individuals","Technology","Building","Sub 1km","District","City","National/regional","Various"), ordered=TRUE), Temporal=factor(Temporal, levels=c("Sub-hourly","Hourly","Daily","Weekly","Monthly","Annual","Decadal","Multiperiod","Static","Various"), ordered=TRUE), Family=factor(Family, levels=c("Empirical","Regression","Optimization","Simulation","Various")), Category=factor(Category, levels=c("Building design","Demand estimation","Descriptive","Impact assessment","Policy assessment","System design","Technology", "Transport","Urban climate","Urban planning")), Supply=factor(Supply, levels=c("none","endogenous","endogenous (indirect)","exogenous")), Demand=factor(Demand, levels=c("none","endogenous","endogenous (indirect)","exogenous")))
Next we want to calculate a dissimilarity matrix. The cluster
package’s daisy method will do that, automatically detecting from our input data that the variables are ordinal or nominal. This is an important step because most of the clustering algorithms assume that the input variables are numerical. In the literature review case, the paper’s attributes are typically represented by non-numerical factors so the dissimilarity matrix must be calculated first.
diss <- daisy(data)
Now we can run the hierarchical clustering to determine how many clusters are in the data. We do this with the agnes method which can process the dissimilarity matrix directly.
# Run the agnes hierarchical clustering agnes.clust <- agnes(diss) # Plot the result plot(agnes.clust)
This gives the following figure:
From this clustering hierarchy, we can judge that there are about 5 clusters within the data. This is a somewhat subjective decision but in general, you want to identify the points where there are large vertical gaps between successive levels of the tree (a height of just below 0.6 on this plot). This document (PDF) provides a nice summary of how to interpret hierarchical clustering results.
We can then run the pam analysis, specifying the number of clusters. The pam object contains several useful elements: a medoid
element which describes the properties of the cluster centers (id.med
is a useful alternative, giving the row id of representative centre), and the clustering
element which tells you which group each data point has been assigned to. We can then make some summary plots as below.
# Calculate 5 pam clusters, directly from dissimilarity matrix pam.cl <- pam(diss,5) # Show medoid characteristics pam$medoid # Use ggplot2 to make a summary plot # Note that since there are six dimensions in the raw data # the figure can't show the clustering perfectly library(ggplot2) # Define the category labels cats <- as.character(data[pam$id.med,]$Category) # Create the ggplot object gg2 <- ggplot(data,aes(x=Spatial,y=Temporal)) + geom_jitter(aes(colour=factor(pam$clustering,labels=cats))) + scale_color_brewer(name="Category",pal="Paired") + theme_bw(11) + opts(axis.text.x=theme_text(angle=90, hjust=1)) + facet_wrap(~Family)
Not every literature review will be ameniable to this type of analysis. But if you have a fairly large set of papers to get through, where it’s hard to see the forest for the trees, a clustering analysis with R can be a great way to get a bit of perspective.
Further reading Quick-R also has a brief summary of cluster analysis with R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.