Gaussian Processes

Gaussian processes are Bayesian, nonparametric models for learning unknown functions. Instead of choosing a fixed functional form such as a line or polynomial, a Gaussian process defines a probability distribution over functions and then updates that distribution after observing data.

They are most commonly used for regression, but they can also be used for classification by combining the latent Gaussian process with a non-Gaussian likelihood, such as a Bernoulli likelihood. Their main appeal is that they produce both a prediction and an uncertainty estimate at each test point. Their main limitation is computational cost, because exact Gaussian process regression requires solving linear systems involving an covariance matrix, where is the number of training points.

Intuition

A Gaussian process says: before seeing data, many functions are possible. After observing training data, we restrict attention to the functions that are consistent with those observations. Where the data are dense, the posterior is confident; where the data are sparse, the posterior remains uncertain.

Background

Multivariate Gaussian Distributions

As the name suggests, the Gaussian distribution is the basic building block of Gaussian processes. In particular, we are interested in the multivariate Gaussian distribution, where each random variable is Normally distributed and the joint distribution is also Gaussian.

The multivariate Gaussian distribution is defined by a mean vector and a covariance matrix :

  1. The mean vector describes the expected value of the distribution, with each component describing the mean of the corresponding dimension.
  2. The covariance matrix describes the variance along each dimension and the covariance between dimensions. The diagonal entry is the variance of the th random variable, and the off-diagonal entry is the covariance between the th and th random variables.

Marginalization and Conditioning

Gaussian distributions have the useful algebraic property of being closed under marginalization and conditioning: the resulting distributions are also Gaussian. Both operations work on subsets of a joint distribution, so we use the following notation:

where and represent subsets of the original random variables.

Through marginalization, we can extract partial information from a multivariate probability distribution:

Interpretation

Each partition and only depends on its corresponding entries in and .

Conditioning is used to describe the distribution of one subset after another subset has been observed. Similar to marginalization, this operation yields a modified Gaussian distribution:

Interpretation

Conditioning can be viewed as making a cut through the multivariate distribution. After observing one subset, the remaining subset still has a Gaussian distribution, but its mean and covariance have changed.

Gaussian Processes

A Gaussian process is a collection of random variables such that every finite subset has a joint multivariate Gaussian distribution. In function-learning problems, the random variables are function values. For example, are treated as jointly Gaussian random variables.

Definition

A Gaussian process is written

where is the mean function and is the covariance function. This means that for any finite set of inputs ,

where and .

The goal is not to estimate a single fixed function directly. Instead, we predict function values at concrete test points by conditioning on observed training data. The key idea is to model the function values at the training points together with the function values at the test points as one multivariate Gaussian distribution.

Suppose the training inputs are with observed responses , and the test inputs are with unknown function values . Gaussian process regression forms the joint distribution of the observed values and the unknown values . Conditioning on gives the posterior predictive distribution for .

This is a Bayesian inference problem. The prior describes which functions are plausible before seeing data, and the posterior describes which functions remain plausible after seeing data.

Intuition

If we evaluate the unknown function at test points, the Gaussian process gives an -dimensional Gaussian distribution over those possible function values. A sample from this distribution is one possible curve evaluated at the test points.

Kernels

How do we set up this distribution and define the mean vector and covariance matrix ?

In Gaussian processes, it is common to assume the mean function is zero:

This simplifies the conditioning equations. This is less restrictive than it may first seem: if the data have a nonzero trend, we can subtract a mean function before fitting the Gaussian process and add it back after prediction.

The main modeling choice is the covariance matrix. We generate it by evaluating a kernel , also called a covariance function, pairwise on all inputs:

The entry describes how strongly the function value at is related to the function value at . If is close to , then is often large, so nearby function values are strongly correlated. If is far from , then may be close to zero, so the two points have little influence on each other.

There are a few common kernel functions with various properties:

  1. RBF kernel:

This produces smooth functions. The length-scale controls how quickly the function can change, and controls the vertical scale.

  1. Periodic kernel:

This produces repeating patterns with period .

  1. Linear kernel:

This produces functions that behave like Bayesian linear regression.

Interpretation

The kernel encodes assumptions about the shape of the unknown function. Smooth kernels prefer smooth functions, periodic kernels prefer repeating functions, and linear kernels prefer linear functions.

Prior

Gaussian processes define a probability distribution over possible functions: each sample from the corresponding multivariate Gaussian represents one realization of the function values at the chosen inputs.

First consider the case where we have not observed any training data. In the context of Bayesian inference, this is the prior distribution. For test inputs , the prior over the unknown function values is

where is the mean vector evaluated at the test inputs and is the covariance matrix obtained by evaluating the kernel on all pairs of test inputs.

Intuition

Before seeing data, the prior lets us sample possible functions. These functions look different depending on the kernel, because the kernel controls how nearby values move together.

Posterior Predictive Distribution

If we observe training data, we can incorporate it into the model to obtain a posterior predictive distribution. To do so, we form the joint distribution between the observed training values and the unknown test values :

Here:

  1. is the covariance matrix between training inputs.
  2. is the covariance matrix between test inputs and training inputs.
  3. is the covariance matrix between test inputs.

Using conditioning, the posterior predictive distribution is

Intuition

The training points constrain the set of plausible functions. Without observation noise, the posterior functions pass exactly through the training points. With observation noise, they are encouraged to pass near the training points instead.

Incorporating Random Noise

Up until now, we have treated the training values as perfect measurements of the unknown function, which is often unrealistic. Gaussian processes handle noisy observations by adding an error term to each training point:

This gives the joint distribution

Again, we use conditioning to derive the posterior predictive distribution:

Interpretation

Adding increases the variance assigned to the training observations. This tells the model not to treat every observed value as exact truth.

Predictive Uncertainty

Since the posterior predictive distribution is a full probability distribution, a Gaussian process gives more than a single point prediction. For each test input , we can marginalize the posterior predictive distribution to get a one-dimensional Normal distribution:

The posterior mean is usually used as the point prediction, and the posterior standard deviation measures uncertainty at that test point. Since each marginal predictive distribution is Normal, we can use the usual Normal quantiles to form uncertainty intervals. For example, an approximate 95% posterior interval for the latent function value is

Warning

In a Bayesian Gaussian process, this interval is usually called a posterior interval or credible interval, not a classical confidence interval. The calculation looks similar to a Normal-based confidence interval because the marginal predictive distribution is Normal, but the interpretation is Bayesian: conditional on the model and observed data, the function value is treated as random.

Intuition

Extracting and helps because, after marginalization, each test point is just an ordinary one-dimensional Normal distribution. Once we know its mean and standard deviation, we can use familiar Normal-distribution tools to say how spread out the prediction is.

If we want uncertainty for a future noisy observation rather than the latent function value, we include the observation noise as well:

This predictive interval is wider because it includes both uncertainty about the underlying function and noise in the future measurement.

Sources and Extended Readings