05 September 2016
This note on (missing reference) is inspired by Jaan Altosaar’s article.
There are at least two perspectives on variational autoencoders which we will discuss in turn: the one of variational inference and the one of neural network autoencoders.
Consider a generative model \(p_{\theta}(\vec x, \vec y) = p_{\theta}(\vec x) p_{\theta}(\vec y \given \vec x)\), parametrised by \(\theta\).
In the variational inference perspective, we posit a posterior approximating probability distribution \(q_{\phi}(\vec x \given \vec y)\) parametrised by \(\phi\) called the variational approximation distribution.
We minimise the following KL divergence with respect to \(\phi\):
\begin{align}
\KL{q_{\phi}(\vec x \given \vec y)}{p_{\theta}(\vec x \given \vec y)}
&= \int q_{\phi}(\vec x \given \vec y) (\log q_{\phi}(\vec x \given \vec y) - \log p_{\theta}(\vec x \given \vec y)) \,\mathrm d\vec x \\
&= \int q_{\phi}(\vec x \given \vec y) (\log q_{\phi}(\vec x \given \vec y) - \log p_{\theta}(\vec x, \vec y) + \log p_{\theta}(\vec y)) \,\mathrm d\vec x \\
&= \log p_{\theta}(\vec y) - \E_{q_{\phi}(\vec x \given \vec y)}[f(\vec x)], \label{eq:vae/kl}\\
f(\vec x) &= \log p_{\theta}(\vec x, \vec y) - \log q_{\phi}(\vec x \given \vec y).
\end{align}
Since the first term in \eqref{eq:vae/kl} is constant with respect to \(\phi\), we have \(\KL{q_{\phi}(\vec x \given \vec y)}{p_{\theta}(\vec x \given \vec y)} + \E_{q_{\phi}(\vec x \given \vec y)}[f(\vec x)]\) constant.
Hence maximising \(\E_{q_{\phi}(\vec x \given \vec y)}[f(\vec x)]\), which is also called the evidence lower bound (ELBO), minimises the required objective with respect to \(\phi\).
The ELBO can be rewritten as follows:
\begin{align}
\E_{q_{\phi}(\vec x \given \vec y)}[f(\vec x)]
&= \int q_{\phi}(\vec x \given \vec y) (\log p_{\theta}(\vec x, \vec y) - q_{\phi}(\vec x \given \vec y)) \,\mathrm d\vec x \label{eq:vae/objective} \\
&= \int q_{\phi}(\vec x \given \vec y) \log p_{\theta}(\vec y \given \vec x) - q_{\phi}(\vec x \given \vec y) (\log q_{\phi}(\vec x \given \vec y) - \log p_{\theta}(\vec x)) \,\mathrm d\vec x \\
&= \E_{q_{\phi}(\vec x \given \vec y)}[\log p_{\theta}(\vec y \given \vec x)] - \KL{q_{\phi}(\vec x \given \vec y)}{p_{\theta}(\vec x)}. \label{eq:vae/autoencoder}
\end{align}
Equation \eqref{eq:vae/autoencoder} brings us to the second perspective of the VAE: the neural network autoencoder.
We actually lied when we said we are only optimising with respect to \(\phi\) because we are optimising with respect to \(\theta\) as well.
Moreover, the parameter \(\phi\) is the output of a neural network whose input is \(\vec y\) and \(q_{\phi}(\vec x \given \vec y)\) is a parametric distribution on \(\vec x\)—usually a multivariate normal distribution.
Similarly, the parameter \(\theta\) is the output of a different neural network whose input is \(\vec x\) and \(p_{\theta}(\vec y \given \vec x)\) is a parametric distribution on \(\vec y\).
The prior \(p_{\theta}(\vec x)\) although parametrised by \(\theta\) usually ends up being a zero mean and diagonal covariance multivariate normal distribution.
In this perspective, \(q_{\phi}(\vec x \given \vec y)\) is also called the encoding model and \(p_{\theta}(\vec y \given \vec x)\) is called the decoding model.
Given this, we can interpret maximising \eqref{eq:vae/autoencoder} with respect to both \(\phi\) and \(\theta\) as follows. Maximising the first term, \(\E_{q_{\phi}(\vec x \given \vec y)}[\log p_{\theta}(\vec y \given \vec x)]\), forces the reconstruction of \(\vec y\) to be accurate. The second term, \(-\KL{q_{\phi}(\vec x \given \vec y)}{p_{\theta}(\vec x)}\), acts as a regulariser which penalises encoders which encode to encodings which are too far from zero. We obtain a regularised autoencoder.
We refer the reader to the original paper (missing reference) for specifics on how to perform backpropagation on the objective in \eqref{eq:vae/objective}, such as the reparametrisation trick (see this note) the REINFORCE trick (Williams, 1992).
The family of work around this idea in effect performs amortized inference. It should be noted that the probabilistic generative models are in this case not directly definable. Instead, the likelihood is also obtained during training, by optimising \(\theta\). Moreover, real world data \(\{\vec y\}\) are required.
References
@article{williams1992simple, title = {Simple statistical gradient-following algorithms for connectionist reinforcement learning}, author = {Williams, Ronald J}, journal = {Machine learning}, volume = {8}, number = {3-4}, pages = {229--256}, year = {1992}, publisher = {Springer} }
[back]