The Expectation-Maximization algorithm

28 September 2016

Given a probabilistic model \(p(x, y \given \theta)\) with latents \(x\), observed variables \(y\) and parameters \(\theta\), we want to find the maximum likelihood estimate (MLE) or the maximum a posteriori (MAP). The Expectation-Maximization (EM) algorithm does that. These notes follow chapter 9.4 of the Bishop book.

EM for MLE

Goal: Find \(\mathrm{argmax}_{\theta} p(y \given \theta)\).

Consider an auxiliary probability distribution \(q(x)\) over the latents. Consider a function(al) \(\mathcal L(q, \theta)\) such that: \begin{align} \log p(y \given \theta) &= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)}. \label{eq:notes/em/p} \end{align} Hence: \begin{align} \mathcal L(q, \theta) &= \log p(y \given \theta) - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) \log p(y \given \theta) \,\mathrm dx - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) (\log p(y \given \theta) - \log q(x) + \log p(x \given y; \theta)) \,\mathrm dx \\
&= \int q(x) (\log p(x, y \given \theta) - \log q(x)) \,\mathrm dx. \end{align}

Since \(\KL{q(x)}{p(x \given y; \theta)} \geq 0\), \begin{align} \log p(y \given \theta) \geq \mathcal L(q, \theta). \end{align}

The algorithm proceeds by improving \(\theta\) iteratively by updating \(q\) in an E-step and \(\theta\) in an M-step:

Initialize \(\theta^{(0)}\).
For \(t = 1, 2, 3, \dotsc\):
- E-step:
  - \(\mathrm{Maximize}_q \mathcal L(q, \theta^{(t - 1)})\) by \begin{align} q^{(t)} \leftarrow p(x \given y; \theta^{(t - 1)}). \label{eq:notes/em/E} \end{align}
- M-step:
  - \(\mathrm{Maximize}_{\theta} \mathcal L(q^{(t)}, \theta)\). \begin{align} \mathcal L(q^{(t)}, \theta) &= \int q^{(t)}(x) (\log p(x, y \given \theta) - \log q^{(t)}(x)) \,\mathrm dx \\
    &= \int p(x \given y; \theta^{(t - 1)}) (\log p(x, y \given \theta) - \log p(x \given y; \theta^{(t - 1)})) \,\mathrm dx \\
    &= \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx - \int p(x \given y; \theta^{(t - 1)}) \log p(x \given y; \theta^{(t - 1)}) \,\mathrm dx \\
    &= \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx + \text{const} \\
    \implies \theta^{(t)} &\leftarrow \mathrm{argmax}_{\theta} \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx. \label{eq:notes/em/M} \end{align}

After the E-step in \eqref{eq:notes/em/E}, \(\mathcal L(q^{(t)}, \theta^{(t - 1)}) = \log p(y \given \theta^{(t - 1)})\) because the KL divergence in \eqref{eq:notes/em/p} becomes zero.

After the M-step in \eqref{eq:notes/em/M}, we obtain \(\mathcal L(q^{(t)}, \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t - 1)})\). Since \(q^{(t)} = p(x \given y; \theta^{(t - 1)})\), the KL divergence in \eqref{eq:notes/em/p} is no longer necessarily zero and \(\log p(y \given \theta^{(t)}) \geq \mathcal L(q^{(t)}, \theta^{(t)})\). Hence \begin{align} \log p(y \given \theta^{(t)}) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) \\
&\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) \\
&= \log p(y \given \theta^{(t - 1)}). \end{align}

EM for MAP

Goal: Find \(\mathrm{argmax}_{\theta} p(\theta \given y)\).

Consider the identity \begin{align} \log p(\theta \given y) &= \log p(y \given \theta) + \log p(\theta) - \log p(y) \\
&= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)} + \log p(\theta) - \log p(y). \label{eq:notes/em/p2} \end{align}

Since the KL divergence is nonnegative, \begin{align} \log p(\theta \given y) \geq \mathcal L(q, \theta) + \log p(\theta) - \log p(y). \end{align}

The algorithm proceeds as follows:

Initialize \(\theta^{(0)}\).
For \(t = 1, 2, 3, \dotsc\):
- E-step:
  - \(\mathrm{Maximize}_q \left\{\mathcal L(q, \theta^{(t - 1)}) + \log p(\theta^{(t - 1)}) - \log p(y)\right\}\) by \begin{align} q^{(t)} \leftarrow p(x \given y; \theta^{(t - 1)}). \end{align}
- M-step:
  - \(\mathrm{Maximize}_\theta \left\{\mathcal L(q^{(t)}, \theta) + \log p(\theta) - \log p(y)\right\}\) by \begin{align} \theta^{(t)} \leftarrow \mathrm{argmax}_{\theta} \left\{\log p(\theta) + \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx\right\}. \label{eq:notes/em/M2} \end{align}

We observe, with a line of reasoning similar to the MLE part, \begin{align} \log p(\theta^{(t)} \given y) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) + \log p(\theta^{(t)}) - \log p(y) & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is nonnegative)} \\
&\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) + \log p(\theta^{(t - 1)}) - \log p(y) & \text{(because of the M-step in \eqref{eq:notes/em/M2})} \\
&= \log p(\theta^{(t - 1)} \given y). & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is zero)} \end{align}

[back]

Tuan Anh Le

The Expectation-Maximization algorithm

EM for MLE

EM for MAP