04 February 2017
Given a family of \(\mathcal X\)-valued random variables \(X_{\theta}\) and a function \(f: \mathcal X \to \mathbb R\), we are interested in estimating gradients of its expectation with respect to \(\theta\): \begin{align} \frac{\partial}{\partial \theta} \E[f(X_{\theta})]. \end{align}
The reparametrization trick relies on finding a random variable \(Z\) and function \(g_{\theta}\) such that \(X_{\theta} = g_{\theta}(Z)\).
The gradient can then be estimated using a Monte Carlo estimator:
\frac{\partial}{\partial \theta} \E[f(X_{\theta})] &= \frac{\partial}{\partial \theta} \E[f(g_{\theta}(Z))] \\
&= \E\left[\frac{\partial}{\partial \theta} f(g_{\theta}(Z)) \right] \\
&= \E\left[f’(g_{\theta}(Z)) \frac{\partial}{\partial \theta} g_{\theta}(Z) \right] \\
&\approx \frac{1}{N} \sum_{n = 1}^N f’(g_{\theta}(z_n)) \frac{\partial}{\partial \theta} g_{\theta}(z_n) \\
&=: I_{\text{reparam}}.
This estimator has a standard Monte Carlo variance: \begin{align} \Var[I_{\text{reparam}}] = \frac{1}{N} \Var\left[f’(g_{\theta}(Z)) \frac{\partial}{\partial \theta} g_{\theta}(Z)\right]. \end{align}
The reinforce trick relies on knowing \(\frac{\partial}{\partial \theta} \log p_{\theta}(x)\) where \(p_{\theta}(x)\) is the density of \(X_{\theta}\).
The gradient is estimated as follows:
\frac{\partial}{\partial \theta} \E[f(X_{\theta})] &= \frac{\partial}{\partial \theta} \int f(x)p_{\theta}(x) \,\mathrm dx \\
&= \int f(x) \frac{\partial}{\partial \theta} p_{\theta}(x) \,\mathrm dx \\
&= \int f(x) \left(\frac{\partial}{\partial \theta} \log p_{\theta}(x) \right) p_{\theta}(x) \,\mathrm dx \\
&= \E\left[f(X_{\theta}) \frac{\partial}{\partial \theta} \log p_{\theta}(X_{\theta})\right] \\
&\approx \frac{1}{N} \sum_{n = 1}^N f(x_n) \frac{\partial}{\partial \theta} \log p_{\theta}(x_n) \\
&=: I_{\text{reinforce}}.
This estimator has a standard Monte Carlo variance: \begin{align} \Var[I_{\text{reinforce}}] = \frac{1}{N} \Var\left[f(X_{\theta}) \frac{\partial}{\partial \theta} \log p_{\theta}(X_{\theta})\right]. \end{align}
We want to compare \(\Var[I_{\text{reparam}}]\) and \(\Var[I_{\text{reinforce}}]\).
It turns out that we can’t make such comparison hold true for all \(f\) and \(X_{\theta}\). For details, check out the proposition 1 from section 3.1.2. of yarin gal’s thesis (Gal, 2016). Tables 3.1. and 3.2. given an example of different \(f\)s for which the variance comparisons are inconsistent.
@phdthesis{gal2016uncertainty, title = {Uncertainty in Deep Learning}, author = {Gal, Yarin}, year = {2016}, school = {University of Cambridge}, link = {http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf} }