Tuan Anh Le

Reverse vs Forward KL

17 December 2017



Bernoulli Example


Hence, the support of our random variable is .

Reverse KL: Zero-Forcing/Mode-Seeking

\begin{align} \KL{q_\phi}{p} &= \sum_x q_\phi(x) \log \frac{q_\phi(x)}{p(x)} \\ &= \phi \log \frac{\phi}{0.5} + (1 - \phi) \log \frac{1 - \phi}{0.5} \\ &= \cdots \\ &= \log\left[\phi^\phi (1 - \phi)^{(1 - \phi)}\right] - \log 0.5. \end{align} Hence, minimizing this quantity is the same as minimizing . Since this quantity is , it is minimized by or .

Forward KL: Mass-Covering/Mean-Seeking

\begin{align} \KL{p}{q_\phi} &= \sum_x p(x) \log \frac{p(x)}{q_\phi(x)} \\ &= 0.5 \log \frac{0.5}{\phi} + 0.5 \log \frac{0.5}{1 - \phi} \\ &= \cdots \\ &= \log 0.5 - 0.5 \log[\phi (1 - \phi)]. \end{align} Hence, minimizing this quantity is the same as maximizing which yields .

Gaussian Example

Let \begin{align} p(x) = \sum_{k = 1}^K \pi_k \mathrm{Normal}(x; \mu_k, \sigma_k^2) \\ q_{\phi}(x) = \mathrm{Normal}(x; \mu_q, \sigma_q^2), \end{align} where , (), , is increasing from to and .

The behavior of minimizing the forward and reverse KL divergences with respect to is as follows:

Python script for generating this figure.

Intuitive explanations and speculations for the origins of the various terms:

Reverse KL: Zero-Forcing/Mode-Seeking

The terms mode-seeking and zero-forcing for minimizing the reverse KL comes from the fact that this minimization forces to be zero where is zero and hence makes it concentrate on one of the modes (last two plots). While the zero-forcing behavior can be explained by looking at the expression for the reverse KL divergence (when is (almost) zero and is non-(almost) zero, this KL is (almost) infinity), the mode-seeking behavior is only a corollary of the zero-forcing behavior which doesn’t need to always occur. For example, note that in the second and third plots, there are two modes but is not mode-seeking because there wasn’t an (almost) zero between the modes. The reverse KL is called the exclusive KL because it exludes a mode.

Forward KL: Mass-Covering/Mean-Seeking

Since there is a term in the forward KL, to make this term small, we must make sure that there is some mass under wherever there is some mass under . We can see this in all the plots. This is where the term mass-covering comes from. The term mean-seeking is merely to contrast with mode-seeking. The term inclusive KL is to contrast with exclusive KL.

For more detailed intuitions, check out this blog post and Wittawat’s comment. For more detailed treatment, check out the references.


  1. Minka, T., & others. (2005). Divergence measures and message passing. Technical report, Microsoft Research.
      title = {Divergence measures and message passing},
      author = {Minka, Tom and others},
      year = {2005},
      institution = {Technical report, Microsoft Research}
  2. Turner, R. E., & Sahani, M. (2011). Two problems with variational expectation maximisation for time-series models. Cambridge University Press.
      title = {Two problems with variational expectation maximisation for time-series models},
      author = {Turner, RE and Sahani, M},
      year = {2011},
      publisher = {Cambridge University Press}