*17 December 2017*

Consider

- , a probability density of , parametrized by , and
- , a probability density of .

Then

- If we minimize the (or the
*reverse/exclusive KL*) with respect to , the*zero-forcing*/*mode-seeking*behavior arises. - If we minimize the (or the
*forward/inclusive KL*) with respect to , the*mass-covering/mean-seeking*behavior arises.

Let

- be a Bernoulli distribution with probability , i.e. , and
- be a Bernoulli distribution with probability , i.e. .

Hence, the support of our random variable is .

\begin{align} \KL{q_\phi}{p} &= \sum_x q_\phi(x) \log \frac{q_\phi(x)}{p(x)} \\ &= \phi \log \frac{\phi}{0.5} + (1 - \phi) \log \frac{1 - \phi}{0.5} \\ &= \cdots \\ &= \log\left[\phi^\phi (1 - \phi)^{(1 - \phi)}\right] - \log 0.5. \end{align} Hence, minimizing this quantity is the same as minimizing . Since this quantity is , it is minimized by or .

\begin{align} \KL{p}{q_\phi} &= \sum_x p(x) \log \frac{p(x)}{q_\phi(x)} \\ &= 0.5 \log \frac{0.5}{\phi} + 0.5 \log \frac{0.5}{1 - \phi} \\ &= \cdots \\ &= \log 0.5 - 0.5 \log[\phi (1 - \phi)]. \end{align} Hence, minimizing this quantity is the same as maximizing which yields .

Let \begin{align} p(x) = \sum_{k = 1}^K \pi_k \mathrm{Normal}(x; \mu_k, \sigma_k^2) \\ q_{\phi}(x) = \mathrm{Normal}(x; \mu_q, \sigma_q^2), \end{align} where , (), , is increasing from to and .

The behavior of minimizing the forward and reverse KL divergences with respect to is as follows:

Intuitive explanations and speculations for the origins of the various terms:

The terms *mode-seeking* and *zero-forcing* for minimizing the reverse KL comes from the fact that this minimization forces to be zero where is zero and hence makes it concentrate on one of the modes (last two plots).
While the *zero-forcing* behavior can be explained by looking at the expression for the reverse KL divergence (when is (almost) zero and is non-(almost) zero, this KL is (almost) infinity), the *mode-seeking* behavior is only a corollary of the *zero-forcing* behavior which doesn’t need to always occur.
For example, note that in the second and third plots, there are two modes but is not *mode-seeking* because there wasn’t an (almost) zero between the modes.
The reverse KL is called the *exclusive KL* because it exludes a mode.

Since there is a term in the forward KL, to make this term small, we must make sure that there is some mass under wherever there is some mass under .
We can see this in all the plots.
This is where the term *mass-covering* comes from.
The term *mean-seeking* is merely to contrast with *mode-seeking*.
The term *inclusive KL* is to contrast with *exclusive KL*.

For more detailed intuitions, check out this blog post and Wittawat’s comment. For more detailed treatment, check out the references.

**References**

- Minka, T., & others. (2005).
*Divergence measures and message passing*. Technical report, Microsoft Research.@techreport{minka2005divergence, title = {Divergence measures and message passing}, author = {Minka, Tom and others}, year = {2005}, institution = {Technical report, Microsoft Research} }

- Turner, R. E., & Sahani, M. (2011). Two problems with variational expectation maximisation for time-series models. Cambridge University Press.
@incollection{turner2011two, title = {Two problems with variational expectation maximisation for time-series models}, author = {Turner, RE and Sahani, M}, year = {2011}, publisher = {Cambridge University Press} }

[back]