gaussian processes for regression

20 June 2017

these notes follow chapter 2 of the gp book (Rasmussen & Williams, 2006) and the tutorial (Rasmussen, 2004).

let \((x_n, y_n)_{n = 1}^N, x_n \in \mathcal X, y_n \in \mathbb R\) be data. our goal is to predict \(y_\ast\) for a new input \(x_\ast \in \mathcal X\) (possibly predict multiple outputs given multiple inputs).

we summarize two views. the weight-space view develops gp regression by extending bayesian linear regression. the function-space view develops the same by viewing gps as distribution over functions.

referring to zoubin’s cube below, the weight-space view starts from bayesian linear regression and follows the downward arrow to obtain gp regression. the function-space view, on the other hand, directly tackles gp regression. the latter is the real deal.

weight-space view

bayesian linear regression

define

feature mapping \(\phi: \mathcal X \to \mathbb R^D\),
features \(\phi_n := \phi(x_n), n = 1, \dotsc, N\),
\(\phi_\ast := \phi(x_\ast)\),
feature matrix \(\Phi := [\phi_1, \dotsc, \phi_N]^T \in \mathbb R^{N \times D}\),
inputs \(x := \{x_1, \dotsc, x_N\}\),
outputs \(y := [y_1, \dotsc, y_N]^T \in \mathbb R^D\),
weights \(w \in \mathbb R^D\).

let the generative model be \begin{align} p(w) &= \mathrm{Normal}(w; 0, \Sigma), \\
p(y \given x, w) &= \mathrm{Normal}(y; \Phi w, \sigma^2 I), \end{align} where \(0\) is a zero vector and \(I\) is an identity matrix of suitable dimensions, \(\Sigma \in \mathbb R^{D \times D}\) is the prior covariance matrix, \(\sigma^2 > 0\) is the observation noise variance.

the posterior over weights is (via completing the square) \begin{align} p(w \given x, y) &= \frac{p(y \given x, w) p(w)}{p(y \given x)} \\
&= \mathrm{Normal}(w; m, S) \\
m &= S \cdot \left(\frac{1}{\sigma^2} \Phi^T y\right) \\
S &= \left(\Sigma^{-1} + \frac{1}{\sigma^2} \Phi^T \Phi\right)^{-1}. \end{align}

the posterior predictive is (via equation 5 of michael osborne’s note) \begin{align} p(y_\ast \given x_\ast, x, y) &= \int p(y_\ast \given x_\ast) p(w \given x, y) \,\mathrm dw \\
&= \mathrm{Normal}(y_\ast \given \phi_\ast^T m, \sigma^2 + \phi_\ast^T S \phi_\ast). \end{align}

the marginal likelihood is (via equation 5 of michael osborne’s note) \begin{align} p(y \given x) &= \int p(y \given x, w) p(w) \,\mathrm dw \\
&= \mathrm{Normal}(y \given 0, \sigma^2 I + \Phi \Sigma \Phi^T) \\
&= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \Phi \Sigma \Phi^T)^{-1/2} \cdot \exp\left(-\frac{1}{2} y^T (\sigma^2 I + \Phi \Sigma \Phi^T)^{-1} y \right). \label{eq:gp-regression/bayesian-linear-regression-marginal-likelihood} \end{align}

bayesian linear regression to gp regression

we can rewrite the posterior predictive (using matrix inversion lemmas described on page 12 of (Rasmussen & Williams, 2006)) \begin{align} p(y_\ast \given x_\ast, x, y) &= \mathrm{Normal}(y_\ast \given \phi_\ast^T m, \sigma^2 + \phi_\ast^T S \phi_\ast) \\
&= \mathrm{Normal}(y_\ast \given \color{red}{\phi_\ast^T \Sigma \Phi^T} (\color{blue}{\Phi \Sigma \Phi^T} + \sigma^2 I)^{-1} y, \color{green}{\phi_\ast^T \Sigma \phi_\ast} - \color{red}{\phi_\ast^T \Sigma \Phi^T} (\color{blue}{\Phi \Sigma \Phi^T} + \sigma^2 I)^{-1} \color{red}{\Phi \Sigma \phi_\ast} ), \\
&= \mathrm{Normal}(y_\ast \given \color{red}{k_\ast^T} (\color{blue}{K} + \sigma^2 I)^{-1} y, \color{green}{k_{\ast \ast}} - \color{red}{k_\ast^T} (\color{blue}{K} + \sigma^2 I)^{-1} \color{red}{k_\ast} ), \label{eq:gp-regression/bayesian-linear-regression-prediction} \\
\color{blue}{K} &:= \color{blue}{\Phi \Sigma \Phi^T}, \\
\color{red}{k_\ast} &:= \color{red}{\Phi \Sigma \phi_\ast}, \\
\color{green}{k_{\ast \ast}} &:= \color{green}{\phi_\ast^T \Sigma \phi_\ast}. \end{align}

similarly, we can rewrite the marginal likelihood \begin{align} p(y \given x) &= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \color{blue}{K})^{-1/2} \cdot \exp\left(-\frac{1}{2} y^T (\sigma^2 I + \color{blue}{K})^{-1} y \right). \end{align}

define a kernel (or covariance) function \(k: \mathcal X \times \mathcal X \to \mathbb R\): \begin{align} k(x, x’) := \phi(x)^T \Sigma \phi(x’), \forall x, x’ \in \mathcal X. \end{align} hence, if we can evaluate \(k\), we can obtain \(\color{blue}{K}, \color{red}{k_\ast}, \color{green}{k_{\ast \ast}}\) without having to do the matrix multiplications. this is exactly what we do in zero-mean gp regression: we specify a kernel function \(k\) using which we can do prediction and evaluate marginal likelihood. see kernel literature for why choosing a positive semidefinite kernel function \(k\) implies existence of feature map \(\phi\). for example, it can be shown that the exponentiated-square kernel function corresponds to bayesian linear regression model with an infinite number of basis functions. using \(k\) instead of performing matrix multiplications is sometimes called kernel trick.

function-space view

this section heavily follows (Rasmussen, 2004).

gaussian process

definition (gaussian process). a gaussian process is a collection of random variables, any finite number of which have consistent joint normal distributions.

the collection of random variables is indexed by \(\mathcal X\) which can be arbitrary. a random variable in this collection indexed by \(x \in \mathcal X\) is denoted \(f(x)\) (instead of \(f_x\)).

consider an arbitrary, finite number \(N\), of random variables from the gaussian process, \((f(x_1), \dotsc, f(x_N)), x_n \in \mathcal X\) (note that the indices \(n\) don’t imply any ordering other than notational convenience. \(x_n\) are the indices of the gaussian process!). by definition, \((f(x_1), \dotsc, f(x_N))\) has a multivariate normal distribution, say \(\mathrm{Normal}(\mu, \Sigma)\).

the consistent part means that any subset of \((f(x_1), \dotsc, f(x_N))\) has a multivariate normal distribution \(\mathrm{Normal}(\mu', \Sigma')\) where \(\mu', \Sigma'\) are the corresponding sub-entries of \(\mu, \Sigma\).

a gp is specified by its mean function \(m: \mathcal X \to \mathbb R\) and covariance function \(k: \mathcal X \times \mathcal X \to \mathbb R\). we consider a gp to be a distribution over functions and write \begin{align} f \sim \mathcal{GP}(m, k). \end{align}

the mean and covariance functions specify fully the distribution of \(f\) at any finite number \(N\) of indices \(x_n \in \mathcal X, n = 1, \dotsc, N\): \begin{align} (f(x_1), \dotsc, f(x_N)) \sim \mathrm{Normal}\left( \begin{bmatrix} m(x_1) \\
\vdots \\
m(x_N) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_N) \\
\vdots & \ddots & \vdots \\
k(x_N, x_1) & \dots & k(x_N, x_N) \end{bmatrix} \right). \end{align} we can also see (by inspection) that the consistency criterion is satisfied.

posterior gaussian process

if we condition on a finite number \(N\) random variables of the gp \(f\), \(f(x_1), \dotsc, f(x_N)\), we also obtain a gp which we call the posterior gp. to this end, let \begin{align} \vec f &:= [f(x_1), \dotsc, f(x_N)]^T \\
\vec f_\ast &:= [f(x_{1, \ast}), \dotsc, f(x_{M, \ast})]^T, \end{align} where \(\vec f\) are the random variables we condition on and \(\vec f_\ast\) are \(M\) random variables at indices \(x_{m, \ast} \in \mathcal X\).

we have \begin{align} \begin{bmatrix} \vec f \\
\vec f_\ast \end{bmatrix} &\sim \mathrm{Normal}\left( \begin{bmatrix} \vec \mu \\
\vec \mu_\ast \end{bmatrix}, \begin{bmatrix} \Sigma & \Sigma_\ast \\
\Sigma_\ast^T & \Sigma_{\ast\ast} \end{bmatrix} \right), \\
\vec \mu &:= [m(x_1), \dotsc, m(x_N)]^T, \\
\vec \mu_\ast &:= [m(x_{1, \ast}), \dotsc, m(x_{M, \ast})]^T, \\
\Sigma &:= \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_N) \\
\vdots & \ddots & \vdots \\
k(x_N, x_1) & \dots & k(x_N, x_N) \end{bmatrix}, \\
\Sigma_\ast &:= \begin{bmatrix} k(x_1, x_{1, \ast}) & \dots & k(x_1, x_{M, \ast}) \\
\vdots & \ddots & \vdots \\
k(x_N, x_{1, \ast}) & \dots & k(x_N, x_{M, \ast}) \end{bmatrix}, \\
\Sigma_{\ast\ast} &:= \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_{M, \ast}) \\
\vdots & \ddots & \vdots \\
k(x_{M, \ast}, x_1) & \dots & k(x_{M, \ast}, x_{M, \ast}) \end{bmatrix}. \end{align}

using equation 2 in michael osborne’s note, we obtain \begin{align} \vec f_\ast \given \vec f &\sim \mathrm{Normal}\left(\vec \mu_\ast + \Sigma_\ast^T \Sigma^{-1} (\vec f - \vec \mu), \Sigma_{\ast \ast} - \Sigma_\ast^T \Sigma^{-1} \Sigma_\ast\right). \label{eq:gp-regression/finite-posterior} \end{align}

by inspection, \begin{align} f \given f(x_1), \dotsc, f(x_N) &\sim \mathcal{GP}(m_{\text{posterior}}, k_{\text{posterior}}), \\
m_{\text{posterior}}(x) &:= m(x) + \Sigma_{\ast, x}^T \Sigma^{-1} (\vec f - \vec \mu), \\
k_{\text{posterior}}(x, x’) &:= k(x, x’) - \Sigma_{\ast, x}^T \Sigma^{-1} \Sigma_{\ast, x’}, \\
\Sigma_{\ast, x} &:= [k(x_1, x), \dotsc, k(x_N, x)]^T. \end{align} this is true by inspection in the sense that the distribution of \(M\) arbitrary random variables \(x_{1, \ast}, \dotsc, x_{M, \ast}\) from this process is a multivariate normal distribution given in \eqref{eq:gp-regression/finite-posterior}, i.e. \begin{align} \begin{bmatrix} m_{\text{posterior}}(x_{1, \ast}) \\
\vdots \\
m_{\text{posterior}}(x_{M, \ast}) \end{bmatrix} &= \vec \mu_\ast + \Sigma_\ast^T \Sigma^{-1} (\vec f - \vec \mu), \\
\begin{bmatrix} k_{\text{posterior}}(x_{1, \ast}, x_{1, \ast}) & \dots & k_{\text{posterior}}(x_{1, \ast}, x_{M, \ast}) \\
\vdots & \ddots & \vdots \\
k_{\text{posterior}}(x_{M, \ast}, x_{1, \ast}) & \dots & k_{\text{posterior}}(x_{M, \ast}, x_{M, \ast}) \end{bmatrix} &= \Sigma_{\ast \ast} - \Sigma_\ast^T \Sigma^{-1} \Sigma_\ast. \end{align}

noisy observations

consider the case when the points we observe are noisy observations of the gp at indices \(x_1, \dotsc, x_N\), i.e. \begin{align} y_n = f(x_n) + \epsilon_n, && \epsilon_n \sim \mathrm{Normal}(0, \sigma^2), n = 1, \dotsc, N. \end{align}

let \begin{align} \vec y &:= [y_1, \dotsc, y_N]^T \\
\vec \epsilon &:= [\epsilon_1, \dotsc, \epsilon_N]^T. \end{align}

we have \begin{align} \begin{bmatrix} \vec y \\
\vec f_\ast \end{bmatrix} &= \begin{bmatrix} \vec f \\
\vec f_\ast \end{bmatrix} + \begin{bmatrix} \vec \epsilon \\
0 \end{bmatrix}. \end{align}

now we have to use some property of multivariate normal distributions (which one?) to obtain \begin{align} \begin{bmatrix} \vec y \\
\vec f_\ast \end{bmatrix} &\sim \mathrm{Normal}\left( \begin{bmatrix} \vec \mu \\
\vec \mu_\ast \end{bmatrix}, \begin{bmatrix} \Sigma + \sigma^2 I & \Sigma_\ast \\
\Sigma_\ast^T & \Sigma_{\ast\ast} \end{bmatrix} \right). \label{eq:gp-regression/joint-noise} \end{align}

using the same line as before, we obtain \begin{align} \vec f_\ast \given \vec y &\sim \mathrm{Normal}\left(\vec \mu_\ast + \Sigma_\ast^T (\Sigma + \sigma^2 I)^{-1} (\vec f - \vec \mu), \Sigma_{\ast \ast} - \Sigma_\ast^T (\Sigma + \sigma^2 I)^{-1} \Sigma_\ast\right). \label{eq:gp-regression/finite-posterior-noisy} \end{align} and \begin{align} f \given \vec y &\sim \mathcal{GP}(m_{\text{posterior (noisy)}}, k_{\text{posterior (noisy)}}), \\
m_{\text{posterior (noisy)}}(x) &:= m(x) + \Sigma_{\ast, x}^T (\Sigma + \sigma^2 I)^{-1} (\vec f - \vec \mu), \\
k_{\text{posterior (noisy)}}(x, x’) &:= k(x, x’) - \Sigma_{\ast, x}^T (\Sigma + \sigma^2 I)^{-1} \Sigma_{\ast, x’}. \end{align} if \(m \equiv 0\), the distribution of this gp at one test index \(x_\ast\) corresponds to the predictive distribution of bayesian linear regression in \eqref{eq:gp-regression/bayesian-linear-regression-prediction}.

gp learning

usually, the mean and covariance functions of a gp will have hyperparameters \(\theta\). ideally, we want to perform inference to obtain \(p(\theta \given \vec y)\), marginalizing out \(f\). first we need the marginal likelihood \(p(\vec y \given \theta)\).

marginal likelihood

we will ignore the hyperparameters \(\theta\) in this section.

first, note that equation \eqref{eq:gp-regression/joint-noise} implies another gp over \(f_{\text{noisy}}\): \begin{align} f_{\text{noisy}} &\sim \mathcal{GP}(m_{\text{noisy}}, k_{\text{noisy}}) \\
m_{\text{noisy}} &\equiv m \\
k_{\text{noisy}}(x, x’) &:= \begin{cases} k(x, x’) + \sigma^2 & \text{if } x \equiv x’ \equiv x_n \text{ for some } n \\
k(x, x’) & \text{otherwise} \end{cases}. \end{align}

hence, by the marginalization property of a gp, \begin{align} p(\vec y) &= \mathrm{Normal}(\vec y \given \vec \mu, \sigma^2 I + \Sigma) \\
&= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \Sigma)^{-1/2} \cdot \exp\left(-\frac{1}{2} (\vec y - \vec \mu)^T (\sigma^2 I + \Sigma)^{-1} (\vec y - \vec \mu) \right). \label{eq:gp-regression/gp-marginal-likelihood} \end{align} for \(m \equiv 0\), this corresponds to the marginal likelihood of bayesian linear regression in \eqref{eq:gp-regression/bayesian-linear-regression-marginal-likelihood}.

inference

to obtain \(p(\theta \given \vec y)\), we place a prior on \(\theta\) and use the gp marginal likelihood \eqref{eq:gp-regression/gp-marginal-likelihood} as the likelihood: \begin{align} p(\theta \given \vec y) &\propto p(\theta) p(\vec y \given \theta). \end{align}

optimization

or we can just optimize \(p(\vec y \given \theta)\) from \eqref{eq:gp-regression/gp-marginal-likelihood} directly using gradient methods.

References

Rasmussen, C., & Williams, C. (2006). Gaussian Processes for Machine Learning. In Gaussian Processes for Machine Learning. MIT Press.

@book{rasmussen2006gaussian,
  title = {Gaussian Processes for Machine Learning},
  author = {Rasmussen, Carl and Williams, Chris},
  journal = {Gaussian Processes for Machine Learning},
  year = {2006},
  publisher = {MIT Press}
}

Rasmussen, C. E. (2004). Gaussian processes in machine learning. In Advanced lectures on machine learning (pp. 63–71). Springer.

@incollection{rasmussen2004gaussian,
  title = {Gaussian processes in machine learning},
  author = {Rasmussen, Carl Edward},
  booktitle = {Advanced lectures on machine learning},
  pages = {63--71},
  year = {2004},
  publisher = {Springer}
}

[back]

Tuan Anh Le