28 September 2016
Given a probabilistic model \(p(x, y \given \theta)\) with latents \(x\), observed variables \(y\) and parameters \(\theta\), we want to find the maximum likelihood estimate (MLE) or the maximum a posteriori (MAP). The Expectation-Maximization (EM) algorithm does that. These notes follow chapter 9.4 of the Bishop book.
Goal: Find \(\mathrm{argmax}_{\theta} p(y \given \theta)\).
Consider an auxiliary probability distribution \(q(x)\) over the latents.
Consider a function(al) \(\mathcal L(q, \theta)\) such that:
\begin{align}
\log p(y \given \theta) &= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)}. \label{eq:notes/em/p}
\end{align}
Hence:
\begin{align}
\mathcal L(q, \theta) &= \log p(y \given \theta) - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) \log p(y \given \theta) \,\mathrm dx - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) (\log p(y \given \theta) - \log q(x) + \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) (\log p(x, y \given \theta) - \log q(x)) \,\mathrm dx.
\end{align}
Since \(\KL{q(x)}{p(x \given y; \theta)} \geq 0\), \begin{align} \log p(y \given \theta) \geq \mathcal L(q, \theta). \end{align}
The algorithm proceeds by improving \(\theta\) iteratively by updating \(q\) in an E-step and \(\theta\) in an M-step:
After the E-step in \eqref{eq:notes/em/E}, \(\mathcal L(q^{(t)}, \theta^{(t - 1)}) = \log p(y \given \theta^{(t - 1)})\) because the KL divergence in \eqref{eq:notes/em/p} becomes zero.
After the M-step in \eqref{eq:notes/em/M}, we obtain \(\mathcal L(q^{(t)}, \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t - 1)})\).
Since \(q^{(t)} = p(x \given y; \theta^{(t - 1)})\), the KL divergence in \eqref{eq:notes/em/p} is no longer necessarily zero and \(\log p(y \given \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t)})\). Hence
\begin{align}
\log p(y \given \theta^{(t)}) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) \\
&\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) \\
&= \log p(y \given \theta^{(t - 1)}).
\end{align}
Goal: Find \(\mathrm{argmax}_{\theta} p(\theta \given y)\).
Consider the identity
\begin{align}
\log p(\theta \given y) &= \log p(y \given \theta) + \log p(\theta) - \log p(y) \\
&= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)} + \log p(\theta) - \log p(y). \label{eq:notes/em/p2}
\end{align}
Since the KL divergence is nonnegative, \begin{align} \log p(\theta \given y) \geq \mathcal L(q, \theta) + \log p(\theta) - \log p(y). \end{align}
The algorithm proceeds as follows:
We observe, with a line of reasoning similar to the MLE part,
\begin{align}
\log p(\theta^{(t)} \given y) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) + \log p(\theta^{(t)}) - \log p(y) & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is nonnegative)} \\
&\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) + \log p(\theta^{(t - 1)}) - \log p(y) & \text{(because of the M-step in \eqref{eq:notes/em/M2})} \\
&= \log p(\theta^{(t - 1)} \given y). & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is zero)}
\end{align}
[back]