Tuan Anh Le

gaussian processes for regression

20 June 2017

these notes follow chapter 2 of the gp book (Rasmussen & Williams, 2006) and the tutorial (Rasmussen, 2004).

let be data. our goal is to predict for a new input (possibly predict multiple outputs given multiple inputs).

we summarize two views. the weight-space view develops gp regression by extending bayesian linear regression. the function-space view develops the same by viewing gps as distribution over functions.

referring to zoubin’s cube below, the weight-space view starts from bayesian linear regression and follows the downward arrow to obtain gp regression. the function-space view, on the other hand, directly tackles gp regression. the latter is the real deal.

zoubin's cube

weight-space view

bayesian linear regression


let the generative model be \begin{align} p(w) &= \mathrm{Normal}(w; 0, \Sigma), \\ p(y \given x, w) &= \mathrm{Normal}(y; \Phi w, \sigma^2 I), \end{align} where is a zero vector and is an identity matrix of suitable dimensions, is the prior covariance matrix, is the observation noise variance.

the posterior over weights is (via completing the square) \begin{align} p(w \given x, y) &= \frac{p(y \given x, w) p(w)}{p(y \given x)} \\ &= \mathrm{Normal}(w; m, S) \\ m &= S \cdot \left(\frac{1}{\sigma^2} \Phi^T y\right) \\ S &= \left(\Sigma^{-1} + \frac{1}{\sigma^2} \Phi^T \Phi\right)^{-1}. \end{align}

the posterior predictive is (via equation 5 of michael osborne’s note) \begin{align} p(y_\ast \given x_\ast, x, y) &= \int p(y_\ast \given x_\ast) p(w \given x, y) \,\mathrm dw \\ &= \mathrm{Normal}(y_\ast \given \phi_\ast^T m, \sigma^2 + \phi_\ast^T S \phi_\ast). \end{align}

the marginal likelihood is (via equation 5 of michael osborne’s note) \begin{align} p(y \given x) &= \int p(y \given x, w) p(w) \,\mathrm dw \\ &= \mathrm{Normal}(y \given 0, \sigma^2 I + \Phi \Sigma \Phi^T) \\ &= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \Phi \Sigma \Phi^T)^{-1/2} \cdot \exp\left(-\frac{1}{2} y^T (\sigma^2 I + \Phi \Sigma \Phi^T)^{-1} y \right). \label{eq:gp-regression/bayesian-linear-regression-marginal-likelihood} \end{align}

bayesian linear regression to gp regression

we can rewrite the posterior predictive (using matrix inversion lemmas described on page 12 of (Rasmussen & Williams, 2006)) \begin{align} p(y_\ast \given x_\ast, x, y) &= \mathrm{Normal}(y_\ast \given \phi_\ast^T m, \sigma^2 + \phi_\ast^T S \phi_\ast) \\ &= \mathrm{Normal}(y_\ast \given \color{red}{\phi_\ast^T \Sigma \Phi^T} (\color{blue}{\Phi \Sigma \Phi^T} + \sigma^2 I)^{-1} y, \color{green}{\phi_\ast^T \Sigma \phi_\ast} - \color{red}{\phi_\ast^T \Sigma \Phi^T} (\color{blue}{\Phi \Sigma \Phi^T} + \sigma^2 I)^{-1} \color{red}{\Phi \Sigma \phi_\ast} ), \\ &= \mathrm{Normal}(y_\ast \given \color{red}{k_\ast^T} (\color{blue}{K} + \sigma^2 I)^{-1} y, \color{green}{k_{\ast \ast}} - \color{red}{k_\ast^T} (\color{blue}{K} + \sigma^2 I)^{-1} \color{red}{k_\ast} ), \label{eq:gp-regression/bayesian-linear-regression-prediction} \\ \color{blue}{K} &:= \color{blue}{\Phi \Sigma \Phi^T}, \\ \color{red}{k_\ast} &:= \color{red}{\Phi \Sigma \phi_\ast}, \\ \color{green}{k_{\ast \ast}} &:= \color{green}{\phi_\ast^T \Sigma \phi_\ast}. \end{align}

similarly, we can rewrite the marginal likelihood \begin{align} p(y \given x) &= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \color{blue}{K})^{-1/2} \cdot \exp\left(-\frac{1}{2} y^T (\sigma^2 I + \color{blue}{K})^{-1} y \right). \end{align}

define a kernel (or covariance) function : \begin{align} k(x, x’) := \phi(x)^T \Sigma \phi(x’), \forall x, x’ \in \mathcal X. \end{align} hence, if we can evaluate , we can obtain without having to do the matrix multiplications. this is exactly what we do in zero-mean gp regression: we specify a kernel function using which we can do prediction and evaluate marginal likelihood. see kernel literature for why choosing a positive semidefinite kernel function implies existence of feature map . for example, it can be shown that the exponentiated-square kernel function corresponds to bayesian linear regression model with an infinite number of basis functions. using instead of performing matrix multiplications is sometimes called kernel trick.

function-space view

this section heavily follows (Rasmussen, 2004).

gaussian process

definition (gaussian process). a gaussian process is a collection of random variables, any finite number of which have consistent joint normal distributions.

the collection of random variables is indexed by which can be arbitrary. a random variable in this collection indexed by is denoted (instead of ).

consider an arbitrary, finite number , of random variables from the gaussian process, (note that the indices don’t imply any ordering other than notational convenience. are the indices of the gaussian process!). by definition, has a multivariate normal distribution, say .

the consistent part means that any subset of has a multivariate normal distribution where are the corresponding sub-entries of .

a gp is specified by its mean function and covariance function . we consider a gp to be a distribution over functions and write \begin{align} f \sim \mathcal{GP}(m, k). \end{align}

the mean and covariance functions specify fully the distribution of at any finite number of indices : \begin{align} (f(x_1), \dotsc, f(x_N)) \sim \mathrm{Normal}\left( \begin{bmatrix} m(x_1) \\ \vdots \\ m(x_N) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_N) \\ \vdots & \ddots & \vdots \\ k(x_N, x_1) & \dots & k(x_N, x_N) \end{bmatrix} \right). \end{align} we can also see (by inspection) that the consistency criterion is satisfied.

posterior gaussian process

if we condition on a finite number random variables of the gp , , we also obtain a gp which we call the posterior gp. to this end, let \begin{align} \vec f &:= [f(x_1), \dotsc, f(x_N)]^T \\ \vec f_\ast &:= [f(x_{1, \ast}), \dotsc, f(x_{M, \ast})]^T, \end{align} where are the random variables we condition on and are random variables at indices .

we have \begin{align} \begin{bmatrix} \vec f \\ \vec f_\ast \end{bmatrix} &\sim \mathrm{Normal}\left( \begin{bmatrix} \vec \mu \\ \vec \mu_\ast \end{bmatrix}, \begin{bmatrix} \Sigma & \Sigma_\ast \\ \Sigma_\ast^T & \Sigma_{\ast\ast} \end{bmatrix} \right), \\ \vec \mu &:= [m(x_1), \dotsc, m(x_N)]^T, \\ \vec \mu_\ast &:= [m(x_{1, \ast}), \dotsc, m(x_{M, \ast})]^T, \\ \Sigma &:= \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_N) \\ \vdots & \ddots & \vdots \\ k(x_N, x_1) & \dots & k(x_N, x_N) \end{bmatrix}, \\ \Sigma_\ast &:= \begin{bmatrix} k(x_1, x_{1, \ast}) & \dots & k(x_1, x_{M, \ast}) \\ \vdots & \ddots & \vdots \\ k(x_N, x_{1, \ast}) & \dots & k(x_N, x_{M, \ast}) \end{bmatrix}, \\ \Sigma_{\ast\ast} &:= \begin{bmatrix} k(x_1, x_1) & \dots & k(x_1, x_{M, \ast}) \\ \vdots & \ddots & \vdots \\ k(x_{M, \ast}, x_1) & \dots & k(x_{M, \ast}, x_{M, \ast}) \end{bmatrix}. \end{align}

using equation 2 in michael osborne’s note, we obtain \begin{align} \vec f_\ast \given \vec f &\sim \mathrm{Normal}\left(\vec \mu_\ast + \Sigma_\ast^T \Sigma^{-1} (\vec f - \vec \mu), \Sigma_{\ast \ast} - \Sigma_\ast^T \Sigma^{-1} \Sigma_\ast\right). \label{eq:gp-regression/finite-posterior} \end{align}

by inspection, \begin{align} f \given f(x_1), \dotsc, f(x_N) &\sim \mathcal{GP}(m_{\text{posterior}}, k_{\text{posterior}}), \\ m_{\text{posterior}}(x) &:= m(x) + \Sigma_{\ast, x}^T \Sigma^{-1} (\vec f - \vec \mu), \\ k_{\text{posterior}}(x, x’) &:= k(x, x’) - \Sigma_{\ast, x}^T \Sigma^{-1} \Sigma_{\ast, x’}, \\ \Sigma_{\ast, x} &:= [k(x_1, x), \dotsc, k(x_N, x)]^T. \end{align} this is true by inspection in the sense that the distribution of arbitrary random variables from this process is a multivariate normal distribution given in \eqref{eq:gp-regression/finite-posterior}, i.e. \begin{align} \begin{bmatrix} m_{\text{posterior}}(x_{1, \ast}) \\ \vdots \\ m_{\text{posterior}}(x_{M, \ast}) \end{bmatrix} &= \vec \mu_\ast + \Sigma_\ast^T \Sigma^{-1} (\vec f - \vec \mu), \\ \begin{bmatrix} k_{\text{posterior}}(x_{1, \ast}, x_{1, \ast}) & \dots & k_{\text{posterior}}(x_{1, \ast}, x_{M, \ast}) \\ \vdots & \ddots & \vdots \\ k_{\text{posterior}}(x_{M, \ast}, x_{1, \ast}) & \dots & k_{\text{posterior}}(x_{M, \ast}, x_{M, \ast}) \end{bmatrix} &= \Sigma_{\ast \ast} - \Sigma_\ast^T \Sigma^{-1} \Sigma_\ast. \end{align}

noisy observations

consider the case when the points we observe are noisy observations of the gp at indices , i.e. \begin{align} y_n = f(x_n) + \epsilon_n, && \epsilon_n \sim \mathrm{Normal}(0, \sigma^2), n = 1, \dotsc, N. \end{align}

let \begin{align} \vec y &:= [y_1, \dotsc, y_N]^T \\ \vec \epsilon &:= [\epsilon_1, \dotsc, \epsilon_N]^T. \end{align}

we have \begin{align} \begin{bmatrix} \vec y \\ \vec f_\ast \end{bmatrix} &= \begin{bmatrix} \vec f \\ \vec f_\ast \end{bmatrix} + \begin{bmatrix} \vec \epsilon \\ 0 \end{bmatrix}. \end{align}

now we have to use some property of multivariate normal distributions (which one?) to obtain \begin{align} \begin{bmatrix} \vec y \\ \vec f_\ast \end{bmatrix} &\sim \mathrm{Normal}\left( \begin{bmatrix} \vec \mu \\ \vec \mu_\ast \end{bmatrix}, \begin{bmatrix} \Sigma + \sigma^2 I & \Sigma_\ast \\ \Sigma_\ast^T & \Sigma_{\ast\ast} \end{bmatrix} \right). \label{eq:gp-regression/joint-noise} \end{align}

using the same line as before, we obtain \begin{align} \vec f_\ast \given \vec y &\sim \mathrm{Normal}\left(\vec \mu_\ast + \Sigma_\ast^T (\Sigma + \sigma^2 I)^{-1} (\vec f - \vec \mu), \Sigma_{\ast \ast} - \Sigma_\ast^T (\Sigma + \sigma^2 I)^{-1} \Sigma_\ast\right). \label{eq:gp-regression/finite-posterior-noisy} \end{align} and \begin{align} f \given \vec y &\sim \mathcal{GP}(m_{\text{posterior (noisy)}}, k_{\text{posterior (noisy)}}), \\ m_{\text{posterior (noisy)}}(x) &:= m(x) + \Sigma_{\ast, x}^T (\Sigma + \sigma^2 I)^{-1} (\vec f - \vec \mu), \\ k_{\text{posterior (noisy)}}(x, x’) &:= k(x, x’) - \Sigma_{\ast, x}^T (\Sigma + \sigma^2 I)^{-1} \Sigma_{\ast, x’}. \end{align} if , the distribution of this gp at one test index corresponds to the predictive distribution of bayesian linear regression in \eqref{eq:gp-regression/bayesian-linear-regression-prediction}.

gp learning

usually, the mean and covariance functions of a gp will have hyperparameters . ideally, we want to perform inference to obtain , marginalizing out . first we need the marginal likelihood .

marginal likelihood

we will ignore the hyperparameters in this section.

first, note that equation \eqref{eq:gp-regression/joint-noise} implies another gp over : \begin{align} f_{\text{noisy}} &\sim \mathcal{GP}(m_{\text{noisy}}, k_{\text{noisy}}) \\ m_{\text{noisy}} &\equiv m \\ k_{\text{noisy}}(x, x’) &:= \begin{cases} k(x, x’) + \sigma^2 & \text{if } x \equiv x’ \equiv x_n \text{ for some } n \\ k(x, x’) & \text{otherwise} \end{cases}. \end{align}

hence, by the marginalization property of a gp, \begin{align} p(\vec y) &= \mathrm{Normal}(\vec y \given \vec \mu, \sigma^2 I + \Sigma) \\ &= (2\pi)^{-N/2} \cdot \mathrm{det}(\sigma^2 I + \Sigma)^{-1/2} \cdot \exp\left(-\frac{1}{2} (\vec y - \vec \mu)^T (\sigma^2 I + \Sigma)^{-1} (\vec y - \vec \mu) \right). \label{eq:gp-regression/gp-marginal-likelihood} \end{align} for , this corresponds to the marginal likelihood of bayesian linear regression in \eqref{eq:gp-regression/bayesian-linear-regression-marginal-likelihood}.


to obtain , we place a prior on and use the gp marginal likelihood \eqref{eq:gp-regression/gp-marginal-likelihood} as the likelihood: \begin{align} p(\theta \given \vec y) &\propto p(\theta) p(\vec y \given \theta). \end{align}


or we can just optimize from \eqref{eq:gp-regression/gp-marginal-likelihood} directly using gradient methods.


  1. Rasmussen, C., & Williams, C. (2006). Gaussian Processes for Machine Learning. Gaussian Processes for Machine Learning. MIT Press.
      title = {Gaussian Processes for Machine Learning},
      author = {Rasmussen, Carl and Williams, Chris},
      journal = {Gaussian Processes for Machine Learning},
      year = {2006},
      publisher = {MIT Press}
  2. Rasmussen, C. E. (2004). Gaussian processes in machine learning. In Advanced lectures on machine learning (pp. 63–71). Springer.
      title = {Gaussian processes in machine learning},
      author = {Rasmussen, Carl Edward},
      booktitle = {Advanced lectures on machine learning},
      pages = {63--71},
      year = {2004},
      publisher = {Springer}