Tuan Anh Le

The Expectation-Maximization algorithm

28 September 2016

Given a probabilistic model with latents , observed variables and parameters , we want to find the maximum likelihood estimate (MLE) or the maximum a posteriori (MAP). The Expectation-Maximization (EM) algorithm does that. These notes follow chapter 9.4 of the Bishop book.

EM for MLE

Goal: Find .

Consider an auxiliary probability distribution over the latents. Consider a function(al) such that: \begin{align} \log p(y \given \theta) &= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)}. \label{eq:notes/em/p} \end{align} Hence: \begin{align} \mathcal L(q, \theta) &= \log p(y \given \theta) - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\ &= \int q(x) \log p(y \given \theta) \,\mathrm dx - \int q(x) (\log q(x) - \log p(x \given y; \theta)) \,\mathrm dx \\ &= \int q(x) (\log p(y \given \theta) - \log q(x) + \log p(x \given y; \theta)) \,\mathrm dx \\ &= \int q(x) (\log p(x, y \given \theta) - \log q(x)) \,\mathrm dx. \end{align}

Since , \begin{align} \log p(y \given \theta) \geq \mathcal L(q, \theta). \end{align}

The algorithm proceeds by improving iteratively by updating in an E-step and in an M-step:

  1. Initialize .
  2. For :
    • E-step:
      • by \begin{align} q^{(t)} \leftarrow p(x \given y; \theta^{(t - 1)}). \label{eq:notes/em/E} \end{align}
    • M-step:
      • . \begin{align} \mathcal L(q^{(t)}, \theta) &= \int q^{(t)}(x) (\log p(x, y \given \theta) - \log q^{(t)}(x)) \,\mathrm dx \\ &= \int p(x \given y; \theta^{(t - 1)}) (\log p(x, y \given \theta) - \log p(x \given y; \theta^{(t - 1)})) \,\mathrm dx \\ &= \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx - \int p(x \given y; \theta^{(t - 1)}) \log p(x \given y; \theta^{(t - 1)}) \,\mathrm dx \\ &= \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx + \text{const} \\ \implies \theta^{(t)} &\leftarrow \mathrm{argmax}_{\theta} \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx. \label{eq:notes/em/M} \end{align}

After the E-step in \eqref{eq:notes/em/E}, because the KL divergence in \eqref{eq:notes/em/p} becomes zero.

After the M-step in \eqref{eq:notes/em/M}, we obtain . Since , the KL divergence in \eqref{eq:notes/em/p} is no longer necessarily zero and . Hence \begin{align} \log p(y \given \theta^{(t)}) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) \\ &\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) \\ &= \log p(y \given \theta^{(t - 1)}). \end{align}

EM for MAP

Goal: Find .

Consider the identity \begin{align} \log p(\theta \given y) &= \log p(y \given \theta) + \log p(\theta) - \log p(y) \\ &= \mathcal L(q, \theta) + \KL{q(x)}{p(x \given y; \theta)} + \log p(\theta) - \log p(y). \label{eq:notes/em/p2} \end{align}

Since the KL divergence is nonnegative, \begin{align} \log p(\theta \given y) \geq \mathcal L(q, \theta) + \log p(\theta) - \log p(y). \end{align}

The algorithm proceeds as follows:

  1. Initialize .
  2. For :
    • E-step:
      • by \begin{align} q^{(t)} \leftarrow p(x \given y; \theta^{(t - 1)}). \end{align}
    • M-step:
      • by \begin{align} \theta^{(t)} \leftarrow \mathrm{argmax}_{\theta} \left\{\log p(\theta) + \int p(x \given y; \theta^{(t - 1)}) \log p(x, y \given \theta) \,\mathrm dx\right\}. \label{eq:notes/em/M2} \end{align}

We observe, with a line of reasoning similar to the MLE part, \begin{align} \log p(\theta^{(t)} \given y) &\geq \mathcal L(q^{(t)}, \theta^{(t)}) + \log p(\theta^{(t)}) - \log p(y) & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is nonnegative)} \\ &\geq \mathcal L(q^{(t)}, \theta^{(t - 1)}) + \log p(\theta^{(t - 1)}) - \log p(y) & \text{(because of the M-step in \eqref{eq:notes/em/M2})} \\ &= \log p(\theta^{(t - 1)} \given y). & \text{(because the KL divergence in \eqref{eq:notes/em/p2} is zero)} \end{align}

[back]