21 February 2017
Given different models \(\mathcal M_k, k = 1, \dotsc, K\) of data \(y\), how can we make predictions?
If we assume that \(\mathcal M_k\) is the correct model, we can perform posterior inference on latent variables \(x\): \begin{align} p(x \given y, \mathcal M_k) &= \frac{p(y \given x, \mathcal M_k)p(x \given \mathcal M_k)}{p(y \given \mathcal M_k)} \end{align} where
Let’s go one level higher and perform inference on the models: \begin{align} p(\mathcal M_k \given y) = \frac{p(y \given \mathcal M_k) p(\mathcal M_k)}{p(y)} \end{align} where
We would like to actually know \(p(x \given y)\) and its expectations with respect to various functions.
To calculate this, we do the following (which is usually called Bayesian model averaging):
\begin{align}
p(x \given y) &= \sum_{k = 1}^K p(x, \mathcal M_k \given y) \
&= \sum_{k = 1}^K p(x \given y, \mathcal M_k) p(\mathcal M_k \given y)
\end{align}
where
This is what we would like to do ideally but it can be very difficult, in which case we turn to a poor man’s Bayesian model averaging – Bayesian model selection. Think of finding the maximum a-posteriori estimate instead of doing posterior inference.
Here, we assume that we can’t do Bayesian model averaging and turn into Bayesian model selection. We want to find the best model: \begin{align} \mathcal M^\ast = \argmax_{\mathcal M_k} p(\mathcal M_k \given y) \end{align}
We see that to compare two models \(\mathcal M_i\) and \(\mathcal M_j\), we just need to compare
The prior over models is usually uniform (we don’t prefer any model a priori). Hence, we can compare the evidence to select the best model.
The principle of Occam’s razor says that we should prefer simple models over complicated ones. Bayesian model selection has a somewhat built-in Occam’s razor. A model \(\mathcal M_k\) is complex if the probability mass of \(p(y \given \mathcal M_k)\) is spread around a large area of \(y\). Because a probability distribution must integrate to one, a model being more complex means that it assigns a lower value of \(p(y \given \mathcal M_k)\) to data. On the other hand, if the model is too simple, it won’t assign any probability to \(y\) that is not modeled. We must somehow pick a model that is just right. See illustration below.
Jupyter notebook for generating this plot
Note that a model having a lot of parameters doesn’t mean that it is complex. It’s only complex if \(p(y \given \mathcal M_k)\) is spread around a large area of \(y\).
@article{rasmussen2001occam, title = {Occam's razor}, author = {Rasmussen, Carl Edward and Ghahramani, Zoubin}, journal = {Advances in neural information processing systems}, pages = {294--300}, year = {2001}, publisher = {MIT; 1998} }
@techreport{ghahramani2005note, title = {A note on the evidence and Bayesian Occam’s razor}, author = {Ghahramani, Zoubin and Murray, Iain}, year = {2005}, institution = {Gatsby Unit Technical Report GCNU-TR 2005--003} }
@book{mackay2003information, title = {Information theory, inference and learning algorithms}, author = {MacKay, David JC}, year = {2003}, publisher = {Cambridge university press} }
[back]