Manifold Learning

[This article was first published on R Works, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many algorithms supporting AI and machine learning depend on the notion of embeddings. Data sets are mapped to or embedded in high-dimensional Euclidean vector spaces. Then, various mathematical strategies are employed to reduce data size by mapping these high dimensional points to structures in lower dimensional spaces in ways that preserve some important structural properties of the high dimensional data. Classic examples are the Word2Vec algorithm which maps similar words to nearby points in vector spaces, and Principal Components Analysis which maps multidimensional vector data to lower dimensional spaces while preserving the variance of the data points. The mathematical term for the structures sought to contain the data in the lower dimensional space is manifold. Manifolds are basically vector spaces with additional structure that enable notions such as connectedness and smoothness to make sense. Think of a sphere as a two-dimensional manifold in a three-dimensional space, or lines and circles as one-dimensional structures in a two-dimensional space.

Over the past fifteen years or so, these kinds of geometric ideas about working with data have coalesced into the very mathematical field of Manifold Learning. In this post, I hope to provide a way for those of us who are not mathematicians, but willing to do some work to explore this incredibly interesting field. I’ll do this by pointing to some of the accessible literature, providing a couple of simple examples, and listing some R resources for exploration.

An overview of Manifold Learning

This section comprises some notes on the marvelous review paper Manifold Learning: What, How, and Why by Marina Meilă and Hanyu Zhang. The paper is a comprehensive, clearly written, historical approach at a level suitable for beginners. It is an expert guide to the vast literature on the subject. The Annual Reviews version of the paper at the link above is a pleasure to work with because almost all of the essential papers are hyperlinked.

The Basic Problem

The basic problem motivating manifold learning is data reduction. Given a data set with D features or explanatory variables, how can we transform it into a smaller data set with fewer features in a way that retains all of the essential information and provides some insight about the structure of the data? The idea is similar to PCA. Here, we assume the data exist in a D-dimensional vector space but mostly lie in or near a k-dimensional subspace. PCA provides a linear mapping from to .

The Manifold Assumption

The data are a sample from a probability distribution with support on, or near, a D-dimensional manifold embedded in .

Three Paradigms for Manifold learning:

The term manifold learning was proposed by Roweis & Saul (2000) who proposed the Locally Linear Embedding (LLE) algorithm and Tenenbaum et al. (2000), who introduced the Isomap algorithm. There are three basic approaches to manifold learning: Locally linear approximations, Principal Curves and Surfaces, and Embeddings.

Local Linear Approximations

  • Based on classical PCA
  • PCA performed on a weighted covariance matrix, with weights decaying with distance from any reference point
  • Approximates data locally on a curved manifold around a point
  • Reduces dimension locally but provides no global representation

Principal Curves and Surfaces

  • Data assumed to be of the form
  • The Subspace Constrained Mean Shift (SCMS) algorithm of Ozertem & Erdogmus (2011) iteratively maps each to lying on the principal curve
  • Method can be extended to principal surfaces

Embeddings

Meilă and Zhang propose a provisional taxonomy of embedding algorithms, which they concede is superficial but which adequately characterizes the state of the art. All approaches begin with information about the data summarized in a weighted neighborhood graph. An embedding algorithm then produces a smooth mapping that is designed to distort the neighborhood information as little as possible. The algorithms differ in their choice of information they preserve and in the constraints on smoothness. The fundamental categories of embedding algorithms are:

  • “One-shot” algorithms that derive embedding coordinates from principal eigenvectors of a matrix associated with the neighborhood graph of a data set or by solving an optimization problem.

  • Attraction-repulsion algorithms that proceed from an initial embedding, often produced by iterative improvements of a one-shot algorithm.

One-Shot Embedding Algorithms

One-Shot algorithms include:

Attraction-Replusion Embedding Algorithms

Attraction-Repulsion Algorithms include:

SNE and t-SNE

This section briefly describes the SNE algorithm and its improved variation t-SNE, which were designed to visualize high dimensional data to a two or three-dimensional space.

SNE

The intuition behind Stochastic Neighbor Embedding (SNE), described in the paper by Hinton & Roweis (2002), is to emphasize local distances and employ a cost function that enforces both keeping the images of nearby objects nearby and keeping the images of widely separated objects relatively far apart.

Most embedding methods require each high-dimensional data point to be associated with only a single location in the low-dimensional space, making it difficult to unfold “many-to-one” mappings in which a single ambiguous object really belongs in several disparate locations in the low-dimensional space. SNE tries to place high-dimensional data points in a low-dimensional space so as to optimally preserve neighborhood identity and allow multiple different low-dimensional images. For example, because of its probabilistic formulation, SNE has the ability to be extended to mixtures in which ambiguous high-dimensional objects such as the word “bank” can be associated with several widely-separated images (e.g., both “river” and “finance”) in the low-dimensional space.

The basic idea underlying SNE is to construct a Gaussian probability distribution over each point, , in the high dimensional space that describes the conditional probability that i would pick j as its neighbor.

Then, find a similar distribution, , over the points in the points in the low dimensional space to which the are mapped. If, for all i, , then the similarities will be preserved. The points are found by using gradient descent to minimize the sum of all the Kullback-Liebler divergences using the cost function:

In the high dimensional space, the values are selected by performing a binary search for the that produces a with a fixed perplexity specified by the user. where is the Shannon entropy measured in bits. The for the low dimensional space are set to .

t-SNE

The t-SNE algorithm, described in van der Maaten and Hinton (2008), is an improvement on SNE that overcomes several technical difficulties. The main differences from SNE are that (1) t-SNE uses a symmetric version of the cost function, which has a simpler gradient, and (2) t-SNE uses a t-distribution with one degree of freedom for the points in the low dimensional space. These help overcome optimization problems and mitigate the effect of the Crowding Problem in which the area available in the low dimensional map to accommodate moderately distant data points will not be sufficient.

Van der Maaten and Hinton point out that in both SNE and t-SNE,

“…the gradient may be interpreted as the resultant force created by a set of springs between the map points and . . . The spring between and repels or attracts the map points depending on whether distance between the two in the map is too small or too large to represent the similarities between the two high dimensional points.”

The final result, t-SNE, is an algorithm that preserves local similarities between points while preserving enough of the global structure to recognize clusters. The art of both these algorithms comprises not only marshaling appropriate mathematics to realize intuitive geometric ideas about data relationships, but also in working through the many technical difficulties and optimization problems to provide reasonable performance.

R Examples

This section provides examples of using the tsne package to compute two and three-dimensional embeddings on two different data sets, the palmerpenguins and the AMES Housing Data. Note because it takes several minutes to fit the models on my underpowered MacBook Air, the code below shows how to run the tsne() command, but actually reads in the model fit from an RDS file.

Example 1: The Penguins

For our first example, let’s look at the penguins data set from the palmerpenguins package. It has six variables we can use to feed the t-SNE algorithm, and we know that we would like any clusters identified to correspond with the three species of penguins.

Show the code
df_p <- palmerpenguins::penguins
# glimpse(df_p)

Prepare data frames for fitting the model and then for subsequent plotting.

Show the code
df_p_fit <- df_p |> 
  mutate(island = as.integer(island), sex = as.integer(sex)) |>
  select(c(-year, -species)) |> 
  na.omit()

df_p_plot <- df_p |> select(-year) |> na.omit() 

Fit the t-sne model.

Show the code
# fit_pen <- tsne(df_p_fit, epoch_callback = NULL, perplexity=50)
fit_pen <- readRDS("fit_pen_2D")

Next, we add the coordinates fit by the model to our plotting data frame and plot. As we would expect, the clusters identified by t-SNE line up very nicely with the penguin species, and island. All of the Gentoo are on Biscoe Island.

Show the code
df_p_plot <- df_p_plot |> mutate(x = fit_pen[, 1], y = fit_pen[, 2])

df_p_plot |> ggplot(aes(x, y, colour = species, shape = island)) +
  geom_point() +
  scale_color_manual(values = c("blue", "red", "green")) +
  ggtitle("2D Embedding of Penguins Data")

Here is a projection onto a three-dimensional space.

Show the code
# fit_pen_3D = tsne(df_p_fit, epoch_callback = NULL, perplexity=50, k=3)
fit_pen_3D <- readRDS("fit_pen_3D")

The threejs visualization emphasizes the single Chinstrap observation floating in space near the Adelle clusters and the two Gentoos reaching the edge of Chinstrap Island.

Show the code
x <- fit_pen_3D[,1] 
y <- fit_pen_3D[,2]
z <- fit_pen_3D[,3]


df_p_plot <- df_p_plot |> mutate(
  color = if_else(species == "Adelie", "blue", 
                  if_else( species == "Gentoo","green", "red")))

scatterplot3js(x,y,z, color=df_p_plot$color, cex.symbols = .3, 
               labels = df_p_plot$species)

Examble 2: The Ames Housing Dataset

With 74 variables, the AMES Housing Data provides a convincing display of of the usefulness of the t-sne algorithm.

Show the code
data(ames, package = "modeldata")
#glimpse(ames)

Prepare ames for processing with tsne by changing all factors to numeric data and fit the model.

Show the code
df_ames <- ames %>% mutate_if(is.factor, as.numeric)
#fit_ames <- tsne(df_ames, epoch_callback = NULL, perplexity=50)
fit_ames <- readRDS("tsne_fit_ames")
#head(fit_ames)

df_ames_plot <- ames |> mutate(x = fit_ames[,1], y = fit_ames[,2],
                               MS_Zoning = as.character(MS_Zoning),
                               MS_Zoning = replace(MS_Zoning, MS_Zoning == "C_all", "C or I"),
                               MS_Zoning = replace(MS_Zoning, MS_Zoning == "I_all" , "C or I")
                               )
                              
                               
             
            
df_ames_plot |> ggplot(aes(x,y, shape = MS_Zoning, color = Neighborhood)) + 
                geom_point() +
                guides(color = FALSE, size = "none") +
                ggtitle("2D Embedding of Ames Data Colored by Neighborhood")

Here is the projection onto a three-dimensional space that is also colored by Neighborhood. It shows the separation among the clusters which appear to reside in a three-dimensional ellipsoid. As was the case with the two-dimensional plot, neighborhoods appear to be fairly well mixed among the clusters.

Show the code
set.seed(1234)

#fit_ames_3D <- tsne(ames, epoch_callback = NULL, perplexity=50,k=3)
fit_ames_3D <- readRDS("tsne_fit_ames_3D")
#head(fit_ames_3D)

df_plot_ames_3D <- df_ames |> mutate(
  x = fit_ames_3D[, 1],
  y = fit_ames_3D[, 2],
  z = fit_ames_3D[, 3],
  neighborhood = ames$Neighborhood,
  zone = ames$MS_Zoning
)

x <- fit_ames_3D[, 1]
y <- fit_ames_3D[, 2]
z <- fit_ames_3D[, 3]

scatterplot3js(x, y, z, cex.symbols = .1, col = rainbow(length(df_plot_ames_3D$Neighborhood)))
To leave a comment for the author, please follow the link and comment on their blog: R Works.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)