Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The initial motivation for me to begin reading about Gaussian process (GP) regression came from Markus Gesmann’s blog entry about generalized linear models in R. The class of models implemented or available with the glm
function in R comprises several interesting members that are standard tools in machine learning and data science, e.g. the logistic regression model.
Looking at the scatter plots shown in Markus’ post reminded me of the amazing talk by Micheal Betancourt (there are actually two videos, but GPs only appear in the second – make sure you watch them both!). In one of the examples, he uses a Gaussian process with logistic link function to model data on the acceptance ratio of gay marriage as a function of age. The results he presented were quite remarkable and I thought that applying the methodology to Markus’ ice cream data set, was a great opportunity to learn what a Gaussian process regression is and how to implement it in Stan. I initially planned not to spend too much time with the theoretical background, but get to meat and potatoes quickly, i.e. try them in practice on a data set, see how they work, make some plots etc. But all introductory texts that I found were either (a) very mathy, or (b) superficial and ad hoc in their motivation. I wasn’t satisfied and had the feeling that GP remained a black box to me. Maybe you had the same impression and now landed on this site?
I was therefore very happy to find this outstanding introduction by David MacKay (DM). I think it is just perfect – a meticulously prepared lecture by someone who is passionate about teaching.
This provided me with just the right amount of intuition and theoretical backdrop to get to grip with GPs and explore their properties in R and Stan.
In this post I will follow DM’s game plan and reproduce some of his examples which provided me with a good intuition what is a Gaussian process regression and using the words of Davic MacKay “Throwing mathematical precision to the winds, a Gaussian process can be defined as a probability distribution on a space of unctions (…)”. In a future post, I will walk through an implementation in Stan, i.e. show how GP regression can be fitted to data and be used for prediction.
Gaussian processes
Let’ start with a standard definition of a Gaussian process
Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
That’s a fairly general definition, and moreover it’s not all too clear what such a collection of rv’s has to do with regressions. To draw the connection, let me plot a bivariate Gaussian
library(mixtools) library(mvtnorm) K <- matrix(c(1, 0.9093729, 0.9093729, 1), ncol = 2) set.seed(1906) p <- rmvnorm(1, mean = rep(0, 2), sigma = K) plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 1: Contours of bivariate Gaussian distribution") abline(h = 0, v = 0) points(p, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red")
p ## [,1] [,2] ## [1,] -1.116858 -1.196803
Where mean and covariance are given in the R code. The point p
has coordinates
To draw the connection to regression, I plot the point p
in a different coordinate system
plot(c(1, 2), p, xlim = c(0, 3), ylim = c(-2, 0), type = "p", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 2: Points in x-y-plane")
where
In our simple starting example, I can draw a line to connect the two dots, much as a regression line would do to illustrate this for two dimensions. But you maybe can imagine how I can go to higher dimensional distributions and fill up any of the gaps before, after or between the two points.
The upshot of this is: every point
par(mfcol = c(1, 2)) N <- 15 set.seed(1906) p <- rmvnorm(N, mean = rep(0, 2), sigma = K) plot(p, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 3a: Multiple draws form the Gaussian") abline(h = 0, v = 0) points(p, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red") plot(rep(1, N), p[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "p", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane") points(rep(2, N), p[, 2], pch = 20, cex = 2) for (i in 1:N) { segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = i, lwd = 1.5) }
I have also drawn the line segments connecting the samples values from the bivariate Gaussian. This illustrates nicely how a zero-mean Gaussian distribution with a simple covariance matrix can define random linear lines in the right-hand side plot. A multivariate Gaussian is like a probability distribution over (finitely many) values of a function.
For now we only have two points on the right, but by going from the bivariate to the
Another instructive view on this is when I introduce measurement errors or noise into the equation
With this my model very much looks like a non-parametric or non-linear regression model with some function
Updating the prior
If the Gaussian distribution that we started with is nothing, but a prior belief about the shape of a function, then we can update this belief in light of the data.
There is a nice way to illustrate how learning from data actually works in this setting. Say, we get to learn the value of
We can try to confirm this intuition using the fact that if
is the covariance matrix of the Gaussian, we can deduce (see here)
This can be simplified to a normal over
with mean
One notheworthy feature of the conditional distribution of
Let’s update Figure 3:
par(mfcol = c(1, 2)) set.seed(1906) y1.actual <- 1.1 y2.hat <- 0 + y1.actual * K[1, 2] y2.var <- (1 - K[1, 2]^2) p2 <- data.frame(y1 = rep(y1.actual, N), y2 = rnorm(N, mean = y2.hat, sd = sqrt(y2.var))) plot(p2, type = "n", xlim = c(-5, 5), ylim = c(-5, 5), ylab = expression(y[2]), xlab = expression(y[1]), main = "Fig. 3a: Multiple draws from the two-dimensional Gaussian") abline(h = 0, v = 0) abline(v = y1.actual, lty = 2, col = "grey", lwd = 1.5) points(p, cex = 2, pch = 20, col = "lightgrey") points(p2, cex = 2, pch = 20) ellipse(mu = rep(0, 2), sigma = K, alpha = 0.05, npoints = 250, col = "darkred") ellipse(mu = rep(0, 2), sigma = K, alpha = 0.2, npoints = 250, col = "red") plot(rep(1, N), p2[, 1], xlim = c(0, 3), ylim = c(-3, 3), type = "n", pch = 20, cex = 2, xlab = "x", ylab = "y", main = "Fig. 3b: Multiple draws in x-y-plane") points(rep(1, N), p[, 1], pch = 20, cex = 2, col = "lightgrey") points(rep(2, N), p[, 2], pch = 20, cex = 2, col = "lightgrey") for (i in 1:N) { segments(x0 = 1, p[i, 1], x1 = 2, p[i, 2], col = "lightgrey", lwd = 1.5, lty = 2) } for (i in 1:N) { segments(x0 = 1, p2[i, 1], x1 = 2, p2[i, 2], col = i, lwd = 1.5) } points(1, y1.actual, pch = 20, cex = 2) points(rep(2, N), p2[, 2], pch = 20, cex = 2)
The conditional distribution is considerably more pointed and the right-hand side plot shows how this helps to narrow down the likely values of
The upshot here is: there is a straightforward way to update the a priori GP to obtain simple expressions for the predictive distribution of points not in our training sample.
Higher dimensions
The connection to non-linear regression becomes more apparent, if we move from a bivariate Gaussian to a higher dimensional distrbution. With more than two dimensions, I cannot draw the underlying contours of the Gaussian anymore, but I can continue to plot the result in the
p <- rmvnorm(N, mean = rep(0, n), sigma = K) plot(1, type = "n", xlim = c(0, 7), ylim = c(-3, 5), xlab = "x", ylab = "y") abline(h = 0, v = 0) for (i in 1:n) { points(rep(i, N), p[, i], pch = 20, cex = 2) } for (i in 1:N) { for (j in 1:(n - 1)) { segments(x0 = j, p[i, j], x1 = j + 1, p[i, j + 1], col = i, lwd = 1.5) } }
where again the mean of the Gaussian is zero and now the covariance matrix is
K ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1.00000000 0.9093729 0.6838614 0.4252832 0.2187119 0.09301449 ## [2,] 0.90937293 1.0000000 0.9093729 0.6838614 0.4252832 0.21871189 ## [3,] 0.68386141 0.9093729 1.0000000 0.9093729 0.6838614 0.42528319 ## [4,] 0.42528319 0.6838614 0.9093729 1.0000000 0.9093729 0.68386141 ## [5,] 0.21871189 0.4252832 0.6838614 0.9093729 1.0000000 0.90937293 ## [6,] 0.09301449 0.2187119 0.4252832 0.6838614 0.9093729 1.00000000
Having added more points confirms our intuition that a Gaussian process is like a probability distribution over functions. It also seems that if we would add more and more points, the lines would become smoother and smoother.
And in fact for the most common specification of Gaussian processes this will be the case, i.e. the GP prior will imply a smooth function.
Covariance function
Drawing more points into the plots was easy for me, because I had the mean and the covariance matrix given, but how exactly did I choose them? Randomly? I will give you the details below, but it should be clear that when we want to define a Gaussian process over an arbitrary (but finite) number of points, we need to provide some systematic way that gives us a covariance matrix and the vector of means. The former is usually denoted as
With
Hence, we see one way we can model our prior belief. If we had a formula that returns covariance matrices that generate this pattern, we were able postulate a prior belief for an arbitrary (finite) dimension.
Squared exponential kernel
The formula I used to generate the $ij$th element of the covariance matrix of the process was
More generally, one may write this covariance function in terms of hyperparameters
Because
The squared exponential kernel is apparently the most common function form for the covariance function in applied work, but it may still seem like a very ad hoc assumption about the covariance structure. It turns out, however, that the squared exponential kernel can be derived from a linear model of basis functions of
Up next
Now that I have a rough idea of what is a Gaussian process regression and how it can be used to do nonlinear regression, the question is how to make them operational. In terms of the Bayesian paradigm, we would like to learn what are likely values for
Fitting a GP to data will be the topic of the next post on Gaussian processes.
Filed under: R, Statistics Tagged: Gaussian Process Regression, Machine Learning, R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.