# Occam’s Razor and Model Selection

21 February 2017

Given different models $\mathcal M_k, k = 1, \dotsc, K$ of data $y$, how can we make predictions?

## Inference

If we assume that $\mathcal M_k$ is the correct model, we can perform posterior inference on latent variables $x$: \begin{align} p(x \given y, \mathcal M_k) &= \frac{p(y \given x, \mathcal M_k)p(x \given \mathcal M_k)}{p(y \given \mathcal M_k)} \end{align} where

• $p(x \given \mathcal M_k)$ is the prior,
• $p(y \given x, \mathcal M_k)$ is the likelihood,
• $p(y \given \mathcal M_k) = \int_{\mathcal X} p(y \given x, \mathcal M_k)p(x \given \mathcal M_k) \,\mathrm dx$ is the evidence.

## Bayesian Model Averaging

Let’s go one level higher and perform inference on the models: \begin{align} p(\mathcal M_k \given y) = \frac{p(y \given \mathcal M_k) p(\mathcal M_k)}{p(y)} \end{align} where

• $p(\mathcal M_k)$ is the prior over models,
• $p(y \given \mathcal M_k)$ is the evidence (from before),
• $p(y)$ is the normalizing constant.

We would like to actually know $p(x \given y)$ and its expectations with respect to various functions. To calculate this, we do the following (which is usually called Bayesian model averaging): \begin{align} p(x \given y) &= \sum_{k = 1}^K p(x, \mathcal M_k \given y) \\ &= \sum_{k = 1}^K p(x \given y, \mathcal M_k) p(\mathcal M_k \given y) \end{align} where

• $p(x \given y, \mathcal M_k)$ is the posterior given the model $\mathcal M_k$ and
• $p(\mathcal M_k \given y)$ is the model posterior.

This is what we would like to do ideally but it can be very difficult, in which case we turn to a poor man’s Bayesian model averaging – Bayesian model selection. Think of finding the maximum a-posteriori estimate instead of doing posterior inference.

## Bayesian Model Selection

Here, we assume that we can’t do Bayesian model averaging and turn into Bayesian model selection. We want to find the best model: \begin{align} \mathcal M^\ast = \argmax_{\mathcal M_k} p(\mathcal M_k \given y) \end{align}

We see that to compare two models $\mathcal M_i$ and $\mathcal M_j$, we just need to compare

• $p(y \given \mathcal M_i) p(\mathcal M_i)$ and
• $p(y \given \mathcal M_j) p(\mathcal M_j)$.

The prior over models is usually uniform (we don’t prefer any model a priori). Hence, we can compare the evidence to select the best model.

## Occam’s Razor

The principle of Occam’s razor says that we should prefer simple models over complicated ones. Bayesian model selection has a somewhat built-in Occam’s razor. A model $\mathcal M_k$ is complex if the probability mass of $p(y \given \mathcal M_k)$ is spread around a large area of $y$. Because a probability distribution must integrate to one, a model being more complex means that it assigns a lower value of $p(y \given \mathcal M_k)$ to data. On the other hand, if the model is too simple, it won’t assign any probability to $y$ that is not modeled. We must somehow pick a model that is just right. See illustration below.

Note that a model having a lot of parameters doesn’t mean that it is complex. It’s only complex if $p(y \given \mathcal M_k)$ is spread around a large area of $y$.

1. Rasmussen, C. E., & Ghahramani, Z. (2001). Occam’s razor. Advances in Neural Information Processing Systems, 294–300.
@article{rasmussen2001occam,
title = {Occam's razor},
author = {Rasmussen, Carl Edward and Ghahramani, Zoubin},
journal = {Advances in neural information processing systems},
pages = {294--300},
year = {2001},
publisher = {MIT; 1998}
}

2. Ghahramani, Z., & Murray, I. (2005). A note on the evidence and Bayesian Occam’s razor. Gatsby Unit Technical Report GCNU-TR 2005–003.
@techreport{ghahramani2005note,
title = {A note on the evidence and Bayesian Occam’s razor},
author = {Ghahramani, Zoubin and Murray, Iain},
year = {2005},
institution = {Gatsby Unit Technical Report GCNU-TR 2005--003}
}

3. MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge university press.
@book{mackay2003information,
title = {Information theory, inference and learning algorithms},
author = {MacKay, David JC},
year = {2003},
publisher = {Cambridge university press}
}


[back]