Topological Data Analysis with R

Joseph Rickert

8 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

When I was in graduate school in the mid '70s Mathematics departments were still under the spell of abstraction for its own sake. At that time, Algebraic Topology which uses concepts from Abstract Algebra to study topological spaces was a major gateway to the realm of abstraction. On my first visit, it was not at all clear that any of the exotic creatures to be found there: simplices, homotopy groups, homology groups etc. would have anything to say about the Real line let alone the real world. So it is astounding to me that over the last several years the masters of mathematical abstraction have made Topological Data Analysis (TDA), a subfield of Computational Topology, an exciting technique for dealing with high dimensional data sets that shows great promise.

One fundamental idea of TDA is to consider a data set to be a sample or “point cloud” taken from a manifold in some high-dimensional topological space. The sample data are used to construct simplices, generalizations of intervals, which are, in turn, are “glued” together to form a kind of wireframe approximation of the manifold. This manifold represents the “shape” of the data, and once you have it, the tools of Algebraic Topology can be used to construct a class of groups, homology groups, that are algebraic analogues of certain properties of the manifold. For example, knowing the number of holes in a manifold at various dimensions characterizes the topology of the manifold. (It is a single hole that makes a torus different from a sphere but makes a coffee cup topologically equivalent to a donut.) Homology groups describe the algebraic analogues of the holes. So, investigating the properties of the homology groups can tell you something about the topology of the manifold and, hopefully, something useful about your data.

All of this so far is just basic Algebraic Topology. A recent breakthrough idea, the notion of "persistent homology", has helped to make TDA a practical tool for data analysis. Persistent homology algorithms look for topological invariants across various scales of a topological manifold. All of the several methods for constructing the simplicial complex that constitutes the wire frame described above involve a scale parameter e, but knowing the number of holes and other structural features that appear at any particular scale is not enough to characterize the manifold. What you really want to know are which features persist over the full range of the parameter e. Efficient algorithms have been developed for computing both the homology groups themselves and a way to visualize them. A “barcode” plot is a qualitative visualization of a homology group that looks like a collection of horizontal line segments. The x axis represents the parameter e and the y axis is an arbitrary ordering of homology generators. (See Ghrist for the details.) When examining a barcode plot you are looking for lines that span a good portion of the x axis. Short lines are most likely just topological noise, long lines that persist across a good bit of the e scale may tell you something about your data. The following barcode plot for the iris data set was drawn with functions from R package phom.

library(phom)
data <- as.matrix(iris[,-5])
head(data)
 
max_dim <- 0
max_f <- 1
 
irisInt0 <- pHom(data, 
                 dimension=max_dim,              # maximum dimension of persistent homology computed
                 max_filtration_value=max_f,     # maximum dimension of filtration complex
                 mode="vr",                      # type of filtration complex
                 metric="euclidean")
 
plotBarcodeDiagram(irisInt0, max_dim, max_f, title="H0 Barcode plot of Iris Data")

I interpret the plot to tell me that there are probably 3 or 4 clusters in the data set. This is a lot of heavy duty mathematical machinery to say something a bit vague about the iris data set, however, while this small example may not be particularly exciting, I hope it serves to whet your appetite. To investigate further have a look at CompTop, Stanford's group for Applied and Computational Algebraic Topology and the resources page on the Ayasdi website. This Palo Alto start up, founded by Stanford professor Gunnar Carlsson, a pioneer in Computational Topology, along with two other Stanford trained mathematicians is on a mission to make TDA mainstream. Be sure to take a look at Professor Carlsson's video which is entertaining, informative and well worth watching.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.