Tuan Anh Le

Importance Sampling

14 September 2017

Consider a probability distribution on with the density . (We will interchangeably denote the probability distribution and its density by the same letter). Given a function , our goal is to estimate \begin{align} I := \E[f(X)] := \int_{\mathcal X} f(x) p(x) \,\mathrm dx. \end{align}

Monte Carlo Estimator

We can estimate this using independently, identically distributed (iid) distributed samples from as follows: \begin{align} X_n &\sim p \\ I_N^{\text{MC}} &= \frac{1}{N} \sum_{n = 1}^N f(X_n). \end{align}

The expectation of is equal to : \begin{align} \E\left[I_N^{\text{MC}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N f(X_n)\right] \\ &= \frac{1}{N} \sum_{n = 1}^N \E\left[f(X_n)\right] \\ &= \frac{1}{N} \sum_{n = 1}^N I \\ &= I. \end{align}

The variance is: \begin{align} \Var\left[I_N^{\text{MC}}\right] &= \E\left[{I_N^{\text{MC}}}^2\right] - \E\left[I_N^{\text{MC}}\right]^2 \\ &= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N f(X_n)\right)^2\right] - I^2 \\ &= \frac{1}{N^2} \E\left[\sum_{n = 1}^N f(X_n)^2 + \sum_{n, \ell, n \neq \ell} f(X_n) f(X_\ell)\right] - I^2 \\ &= \frac{1}{N^2} \left(\sum_{n = 1}^N \E\left[f(X_n)^2\right] + \sum_{n, \ell, n \neq \ell} \E\left[f(X_n) f(X_\ell)\right]\right) - I^2 \\ &= \frac{1}{N^2} \left(N \E\left[f(X)^2\right] + (N^2 - N) \E\left[f(X)\right]^2 \right) - I^2 \\ &= \frac{\E\left[f(X)^2\right] - \E\left[f(X)\right]^2}{N} \\ &= \frac{\Var[f(X)]}{N}. \end{align}

Importance Sampling

Instead of drawing samples from , assume that we can

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\ W_n &= \frac{p(X_n)}{q(X_n)} \\ I_N^{\text{IS}} &= \frac{1}{N} \sum_{n = 1}^N W_n f(X_n). \label{eq:is_est} \end{align}

The expectation of is equal to : \begin{align} \E\left[I_N^{\text{IS}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right] \\ &= \frac{1}{N} \sum_{n = 1}^N \E\left[W_n f(X_n)\right] \\ &= \frac{1}{N} \sum_{n = 1}^N \E\left[\frac{p(X_n)}{q(X_n)} f(X_n)\right] \\ &= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} \frac{p(x_n)}{q(x_n)} f(x_n) q(x_n) \,\mathrm dx_n \\ &= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} p(x) f(x) \,\mathrm dx \\ &= I. \end{align}

The variance of is: \begin{align} \Var\left[I_N^{\text{IS}}\right] &= \E\left[{I_N^{\text{IS}}}^2\right] - I^2 \\ &= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right)^2\right] - I^2 \\ &= \frac{1}{N^2} \E\left[\sum_{n = 1}^N \left(W_n f(X_n)\right)^2 + \sum_{n, \ell, n \neq \ell} \left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\ &= \frac{1}{N^2} \sum_{n = 1}^N \E\left[\left(W_n f(X_n)\right)^2\right] + \frac{1}{N^2} \sum_{n, \ell, n \neq \ell} \E\left[\left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\ &= \frac{1}{N} \left(\int_{\mathcal X} \left(\frac{p(x)}{q(x)} f(x)\right)^2 q(x) \,\mathrm dx - I^2 \right) \\ &= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \end{align}

The “optimal” proposal has the density \begin{align} q^{\text{opt}}(x) &= \frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}. \end{align} This proposal is optimal in the sense that estimator’s variance is lowest possible. The variance of an estimator using an optimal proposal, is not greater than the variance of an estimator using an arbitrary proposal , : \begin{align} \Var\left[I_N^{\text{IS, opt}}\right] &= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{\frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}} \,\mathrm dx - I^2 \right) \\ &= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} \frac{p(x)^2 f(x)^2}{|f(x)| p(x)} \,\mathrm dx - I^2 \right) \\ &= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} |f(x)| p(x) \,\mathrm dx - I^2 \right) \\ &= \frac{1}{N} \left(\left(\int_{\mathcal X} \frac{|f(x)| p(x)}{q(x)} \cdot 1 \cdot q(x) \,\mathrm dx\right)^2 - I^2 \right) \\ &\leq \frac{1}{N} \left(\left(\int_{\mathcal X} \left(\frac{|f(x)| p(x)}{q(x)}\right)^2 \cdot q(x) \,\mathrm dx\right) \cdot \left(\int_{\mathcal X} 1^2 \cdot q(x) \,\mathrm dx\right) - I^2 \right) & \text{(Cauchy-Schwarz ineq.)}\\ &= \frac{1}{N} \left(\int_{\mathcal X} \frac{f(x)^2 p(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \\ &= \Var\left[I_N^{\text{IS}}\right]. \end{align}

Self-Normalized Importance Sampling

Now, assume that we can

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\ \tilde W_n &= \frac{\tilde p(X_n)}{q(X_n)} \\ \tilde I_N^{\text{IS}} &= \frac{\frac{1}{N}\sum_{n = 1}^N \tilde W_n f(X_n)}{\frac{1}{N}\sum_{n = 1}^N \tilde W_n}. \label{eq:self_norm_est} \end{align}

Consistency of The Estimator

Theorem. converges to almost surely, i.e. \begin{align} P\left(\left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}\right) = 1. \end{align}

Proof. Denote the numerator of \eqref{eq:self_norm_est} by and the denominator by . By the law of the strong law of large numbers, converges to and converges to . Now, let \begin{align} A = \left\{\omega: \lim_{N \to \infty} \alpha_N(\omega) = ZI\right\} \\ B = \left\{\omega: \lim_{N \to \infty} \beta_N(\omega) = Z\right\} \\ C = \left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}. \end{align} We have and hence (see properties probability measures). We also have since for any , and and given the properties of a limit, . Hence . Since , .

Asymptotic Variance

It is more difficult to compute the bias and variance of the self-normalized importance sampling estimator \eqref{eq:self_norm_est} because it cannot be decomposed to a sum of independent terms as in \eqref{eq:is_est}. We resort to the central limit theorem and the delta method which say the following.

Central limit theorem. Given iid random vectors () on with a common mean and covariance , \begin{align} \sqrt{N} \left(\frac{1}{N} \sum_{n = 1}^N v_n - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} where denotes convergence in distribution and denotes a multivariate normal distribution with the covariance .

Delta method. Given and its estimator with the following convergence \begin{align} \sqrt{N} \left(\hat \mu_N - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} then, given a function and the corresponding gradient , we have \begin{align} \sqrt{N} \left(g(\hat \mu_N) - g(\mu)\right) \xrightarrow{d} \mathrm{Normal}\left(0, \nabla g(\mu)^T \Sigma \nabla g(\mu)\right). \end{align}

We can apply these theorems directly to self-normalized importance sampling, where \begin{align} v_n &= \begin{bmatrix} W_n f(X_n) \\ W_n \end{bmatrix}, \\ g\left(\begin{bmatrix} a \\ b \end{bmatrix}\right) &= \frac{a}{b}, \end{align} so that \begin{align} \mu &= \begin{bmatrix} ZI \\ Z \end{bmatrix}, \\ \nabla g\left(\begin{bmatrix} a \\ b \end{bmatrix}\right) &= \begin{bmatrix} 1 / a \\ -a / b^2 \end{bmatrix}, \\ \Sigma &= \begin{bmatrix} \Var(W_n f(X_n)) & \Cov(W_n, W_n f(X_n)) \\ \Cov(W_n, W_n f(X_n)) & \Var(W_n) \end{bmatrix}. \end{align} This results in the following convergence for the self-normalized importance sampling estimator: \begin{align} \sqrt{N} (\tilde I_N^{\text{IS}} - I) \xrightarrow{d} \mathrm{Normal}\left(0, \tilde \sigma_{\text{asymptotic}}^2\right), \end{align} where the asymptotic variance is \begin{align} \tilde \sigma_{\text{asymptotic}}^2 = \int \frac{p(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}

Optimal Proposal

A proposal that minimizes the asymptotic variance has the following form \begin{align} \tilde q^{\text{opt}}(x) &= \frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}. \end{align}

To see this, we can show that the asymptotic variance associated with is not greater than the asymptotic variance associated with any other : \begin{align} \int \frac{p(x)^2 (f(x) - I)^2}{\tilde q^{\text{opt}}(x)} \,\mathrm dx &= \int \frac{p(x)^2 (f(x) - I)^2}{\frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}} \,\mathrm dx \\ &= \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \cdot \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \\ &= \E_q\left[\frac{\pi}{q} |f - I|\right]^2 && \text{(Jensen’s ineq.)} \\ &\leq E_q\left[\frac{\pi^2 |f - I|^2}{q^2}\right] \\ &= \int \frac{\pi(x)^2 |f(x) - I|^2}{q(x)} \,\mathrm dx \\ &= \int \frac{\pi(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}